The birth of Advanced Radiology
People love to ask me why I’m spending a huge fraction of my time clinically training for a job that will obviously be done by computers in the near future. My current go-to answer is: “for chess reasons,” a pure love of the game. To quote from Gwern’s Advanced Chess Obituary:
As automation and AI advance in any field, it will first find a task impossible, then gradually become capable of doing it at all, then eventually capable of better than many or most humans who try to do something, and then better than the best human. But improvement does not stop there, as ‘better than the best human’ may still be worse than ‘the best human using the best tool’; so this implies a further level of skill, where no human is able to improve the AI’s results at all rather than get in the way or harm it. We might call these different phases ‘subhuman’, ‘human’, ‘superhuman’, and ‘ultrahuman’.
We’re clearly in the ‘ultrahuman’ era of chess, and have long since surpassed the ability of humans to add anything to the performance of the best computers.1 And yet we still love to learn to play because it’s fun, because it’s challenging, because it’s social. So when I give this answer, I’m mostly being sincere. Radiology is also fun, and challenging, and social. Today I got on the phone with a cardiothoracic surgeon to talk about a case I was interpreting in real time. What I saw (or didn’t see) would help determine whether a patient needed to be taken back to the operating room for further surgery. This is objectively exciting, and stressful! Tomorrow I’ll read the operative report and find out if we were right or not.
The “chess reasons” answer, however, is not only a claim about my love of radiology, but also implicitly a claim about my belief in the likelihood of its automation. This is the part that is perhaps less than purely sincere, or at least less certain. I think my predictions are unsatisfying to both sides of the automation debate. I’m a bit more moderated than the most spiritually San Franciscan people I talk to, in the sense that I don’t think the whole job will be automated in the next five years. But I also think we’re closer to automation than many of my academic medical friends do. A big part of my motivation in writing this post is to get them to “update”2 their likelihoods of rapid AI progress in our field, and also try to envision what the field will actually look like as this automation occurs. Sort of a “Radiology 2027,” if you will.3
Directly mapping the conceptual progression from the Advanced Chess Obituary to radiology is hard because “Radiology” comprises many diverse tasks, rather than a single neatly demarcated activity. If you try to conceptualize the entire field as a single automatable task, you get a confusing picture. For example, there was good evidence back in 2017 that we were already at least in the superhuman (if not ultrahuman) era for radiology when “CheXNet outperformed four Stanford radiologists in diagnosing pneumonia accurately.” Conversely, in 2025, it looks like we are still in the subhuman era when the creators of a model that can draft reports from CT scans note that it “remains far from ready for clinical deployment.” Some might argue that even if you resolved the inter-task incongruence, measuring progress in automation might still be difficult due to what might be called the “dark matter of labor” theory4, as described by Arvind Narayanan here:
Maybe the “jobs are bundles of tasks” model in labor economics is incomplete. Paraphrasing something @MelMitchell1 pointed out to me, if you define jobs in terms of tasks maybe you’re actually defining away the most nuanced and hardest-to-automate aspects of jobs, which are at the boundaries between tasks.
Radiology certainly has no shortage of task-boundaries and messy edges5. But let’s say that to a first-order approximation, jobs are bundles of tasks, and those “boundaries between tasks” are actually just part of broader tasks that are more open-ended and harder to rigorously define and measure than others6.
Once we accept that approximation, it’s easy to see that Gwern’s chess automation progression is just one axis along which we should be evaluating automation. The chess analogy gets at the depth of automation for a particular task, but neglects the breadth of automation across the tasks of a job. Even in a world where AI is ultrahuman at some particular subtasks,7 it does not imply that it is performant at a task like “full interpretation of a chest radiograph,” which is itself a composite of other subtasks like “analyze clinical history,” “consider prior imaging,” “perceive any imaging abnormalities” and “write an accurate report describing these findings.”
So how are things looking for radiology when we try to evaluate along both axes at once? I think it looks like things are moving pretty fast. For instance, take Rad Partners’ recent acquisition of Cognita. They claim that their new radiology report drafting software (which looks at chest radiographs and non-contrast head CTs) leads to a fourfold reduction in diagnostic errors and a read-time savings of up to 76%. This is remarkable, not just for the accuracy of the model, but also for the breadth of the task. Where in 2017 we were looking at the capability to produce accurate binary labels for a small set of 14 diagnoses on chest radiographs, we can now take in arbitrary sequences of images (like a full cross-sectional CT scan) with arbitrarily interleaved text (like clinical history, prior reports, etc), and produce a full radiology report as output.
While I haven’t seen a full description of the data underlying the RP/Cognita claims that I mentioned above, if we take the claim from the press release at face value that “87-98% of the AI-generated results required one or fewer clinically significant edits,” I think we can grade the level of automation to certainly be human, and quite likely nearly superhuman (since I’d anticipate that even two attending-level radiologists might disagree with at least one aspect of each other’s interpretations 2% of the time). It’s certainly fair to say that we’re in the era of Advanced Radiology, on analogy to the era of Advanced Chess, where we see super-human performance from radiologist + AI systems.
If this is true, what can we predict will come next? As we look at the progress of frontier models in non-medical domains over the last several years, we can see that as we scale up data quality and resource investment (for pre-training scale, post-training quality data generation, etc), the current generation of model (multimodal sequence-to-sequence models) and the current scheme of general training recipes lead to massive improvements. A big bottleneck to this progress in medicine has been how siloed this data is for legal/privacy reasons. But now Radiology Partners, the largest clinical radiology group in the U.S., has a full “technology services division,” and as of last week has acquired a group of state-of-the-art AI researchers from Stanford in Cognita, and apparently intends to hire a lot more AI engineers. So over the next several years I would predict a lot more progress in both the breadth of tasks that are automated (e.g. generating reports for more modalities and indications, handling more complex clinical follow-up questions), as well as the accuracy of models at these tasks.
To temper the hype and enthusiasm a bit, it’s also important to note that unlike computer code and math, there are no automatically verifiable/programmatic environments within which to do reinforcement learning for arbitrary perception from clinical images (i.e. there is no way to programmatically generate histopathological ground truth at scale for arbitrary radiographic images). This likely caps the depth of automation attainable, as the models will mostly be constrained to learn to reproduce the types of reports and patterns that physicians have produced in the past.8 So for the foreseeable future, models will likely struggle at very rare and sparsely described cases, such as scans from patients with bizarre congenital abnormalities, and differentiating pathology from post-surgical anatomy in patients with complex operative histories. Another possible limit on the speed of deployment may be the cost of development. Whether these results are attained with models that use on the order of 10B parameters or 100B parameters (or more) will have a big impact on the cost required to iteratively improve and extend them. These sort of details are, of course, unlikely to be released publicly, but we can make guesses on the basis of models like CT-CHAT, which found optimal performance with 70B transformers.
To me it seems to be the case that in the most conservative trajectory, within the next five years we’ll likely see similar gains across all of the most common imaging modalities seen in the hospital (like MR, more CT scan types, ultrasound, nucs, etc), where radiologists can read at least 4x the number of studies at similar-to-improved error rates. What does the world look like in this most conservative scenario?
One obvious consequence will be that imaging volumes will continue to rise. Soon it will essentially be standard of care for everyone coming to the ED to get cross-sectional imaging.9 I’d also anticipate seeing a much lower threshold for ordering outpatient cross-sectional imaging. I don’t think radiation risks provide a very strong counter-argument to this either, as we’re getting better at protocols that allow us to scan MRI faster and faster (such as Neuromix), and also better at protocols to extract more useful info from lower and lower dose CT scans. There should also be increased capacity to interpret full body screening exams like those done by Prenuvo, Ezra, etc., as well capacity for the additional scientific/clinical inquiry which would be necessary to make those scans useful to the people getting them.
Another consequence is that a greater fraction of the time on the job will be spent interpreting the hardest, most “out-of-distribution” scans.10 As the AI algorithms become better and better at triaging away the normal cases11, it’s not super clear to me how sustainable it is to spend all of your time reading only the most mentally taxing cases. In a world where you own the technology that gives you the productivity boost, maybe you can partially mitigate this by reading a bit slower, or taking breaks during the day or something. But I don’t think it’s obvious that practicing radiologists will really end up “in control” of this technology.
Further into the future (or sooner if progress happens faster), as AI gets better and better, the radiologist’s job starts to look more and more like the clinical pathologist or lab director. The AI algorithms will generate reports for almost all of the imaging, and the radiologist will help manage the operations, quality, and medical accuracy of the outputs. Radiologists will be responsible for validating new outputs, supervising the technologists who take the images, setting quality standards for model outputs, managing model drift, and troubleshooting unexpected results. Importantly, I don’t think our current training prepares us for this at all.
Back in 2016 Geoff Hinton said AI will replace radiologists. Then in 2019 Curt Langlotz said radiologists who use AI will replace radiologists who don’t. In summary, I think my much less catchy iteration might be: a small number of radiologists working at big AI-powered radiology “labs” will replace the vast majority of radiologists. Or maybe just, radiologists are going to become clinical pathologists for images.
So what does this mean if you’re currently a radiology trainee? A few things — I’d guess that the relative value of like a boilerplate ED read is going to go down over time as this is what LLMs will be able to do first. This would suggest that a human’s value-add here is going to be the sort of esoteric poorly described stuff that barely makes it into training corpora. The job market is currently hot, and you can get a very well compensated job in telerads without doing fellowship. But this would be exactly the first thing that will be replaced. So maybe “get your bag” doing that job now? Or do a fellowship? I’m not really sure.12 I’d definitely get as much experience as I could with concrete deployment of AI tools, specifically, experience with how these tools are deployed, monitored, and quality-assured.
Finally, what does this mean for independent radiology groups? Rather than having many radiologists practicing across many small independent groups, I’d strongly bet that large firms like RP that own the technology to increase their radiologists’ productivity will be able to pick up more and more contracts, and consequently will consolidate more of the resources for further AI development, further productivity gains, and further consolidation. Large independent radiology practices may have the capacity to compete on developing tools like this internally, or may purchase tools from external firms like Harrison. Or they may get consolidated into the bigger groups!
In any case, I think this is quite an exciting time for radiology. The precise impact this will have on radiology as a profession is unclear, but I think the inevitable outcomes of a greater volume of more high quality imaging interpretation will clearly benefit patients and for biomedical science writ large.
Thank you to Sam Gilbert-Janizek, Eryney Marrogi, and Jack Sempliner for their feedback on a first draft of this post.
My friend Jack highlighted to me that one interesting thing about Advanced Chess that the Gwern article relegates to the last sentence of the post is the World Correspondence Chess Championship, which “gives a benchmark for the shade of difference between the superhuman and ultrahuman regimes for chess. You see that until recently there was still a chance for human + engine teams to meaningfully compete against each other. However in the 33rd world championship, the only decisive games were achieved when one of the humans piloting the engine died.”
I’ve only been down here for 5 months, have I become Spiritually San Franciscan so quickly?
Per the authors, AI 2027 was informed by “expert feedback, experience at OpenAI, and previous forecasting successes.” I definitely don’t claim any “superforecasting” ability, and I don’t have any experience working at OpenAI, but I hope my expertise in these fields suffices :)
I almost certainly stole this phrasing from @norvid_studies, but can’t find the specific tweet where he said it.
Laura Heacock, a breast radiologist and deep learning researcher at NYU, compiled a great list of examples from her work:
-Lack of consensus on specific type of implant rupture, with no real ground truth available as only very ruptured ones removed
-Rapid shift in guidelines of high-risk benign management affecting rad-path correlation/need for surgery.
-Determining if biopsied lesion was the recommended one
-Cross-correlation of multimodal breast studies to an edge finding on non-breast studies
-Extremely unusual presentation of a cancer being staged on MRI
-Lengthy discussion with patient about recommendation for biopsy vs patient’s preference to defer
-Determining finding is benign because it was seen four years ago on a separate study and not visualized again until current exam (very common in screening, but not currently solvable with screening mammo AI)
I think this is a reasonable approximation, as we’ve continued to make technical progress on rigorous, quantifiable evaluation of more “open-ended” tasks in radiology/NLP. The most useful insight of the “dark labor” theory is that the hard-to-measure and hard-to-record aspects of labor are also the hard to automate ones (given that the way AI automation works is by taking lots of data and learning to predict from the underlying pattern). However, I don’t think there’s any inherent qualitative difference in the type of labor between “dark labor” and the easier to measure stuff that would prevent it from ever being measured. Even if there is some dark labor component that is inherently un-automatable, you could just measure the magnitude of that term, leave it aside, and then measure progress on the remainder.
For example, the still-not-fully-explained results where conv nets can really really accurately predict patient-reported race from plain film radiographs, which is something that radiologists can’t do AT ALL. This is its own really interesting topic that I could talk about for an entire post, but I’ll just link to Soham’s excellent paper describing a generative interpretability method applied to this task, as well as William Lotter’s paper identifying acquisition parameters as a possible mediating factor
You can certainly imagine a world where research agents like Edison's Kosmos are given access to the full medical records/EHRs of patients, instructed to first train pattern matching models to find novel radiographic correlates for clinical outcomes of interest, then go back and annotate these densely across the remainder of the dataset, then consequently learn new full sequence to sequence models, which generate more clinically useful reports than humans did previously. It’s just significantly harder to do this than to have a verifiable math environment or programming environment that automatically gives you feedback like “the program executes and produces the expected output” or not.
As opposed to the present state of affairs where it just feels like everyone who comes to the ED gets cross-sectional imaging.
In addition to the tougher post-surgical cases and congenital cases, this also likely applies to cardiac imaging. At my hospital, we do a lot of interpretation-time post-processing of functional imaging in external software like TeraRecon. These post-processed videos (e.g. standard cardiac views of the heart in motion, curved planar reformats, etc) don’t automatically get saved to the PACS along with the report, meaning that the map from input imaging to output report isn’t quite as neat as in other areas.
For what it’s worth, while most people I’ve talked to are very convinced that triaging away normal cases is an inherently easier problem than providing positive examples of findings, I’m not 100% sure this is the case. As a test case, I played around this past weekend with some different loss functions for a deep learning model to triage away normal chest radiographs without abnormalities. I tested a standard “14-label” approach as well as reducing the labels to binary “Abnormal”/”Normal”. For both of these approaches, I also considered a 10x multiplier on the cost of false negatives. The quality of this system was very dependent upon the loss used. I’ll probably blog more about this later.
In retrospect, I’m unsure if my career advice optimizes for outcomes that most people would like to attain. I got a PhD in machine learning in 2022 and instead of going to make a boatload of money working at one of the big labs, I went back to medical school to learn to do such highly prestigious and well-compensated things like “learn to change a woundvac” and “get yelled at to do a better job retracting during surgery”.



Thanks for writing this, it clarifies alot. The chess analogy is absolutely briliant for explaining the ‘ultrahuman’ phase of AI. I often try to articulate this idea to my students. You really hit the nail on the head; it’s never just about peek performance, is it?