Updated March 5, 2026
0:00 Welcome to Colaberry AI podcast brought to you by Colaberry AI Research Labs and Carle Foundation. Welcome to this deep dive. Today, we're exploring, really a remarkable leap in how artificial intelligence is being used for medical diagnosis. That's right. Specifically, we're looking at tackling some of the toughest cases, the ones that even expert physicians find challenging. 0:22 Mhmm. We're digging into cutting edge research from Microsoft AI and particularly their new system, the Microsoft AI Diagnostic Orchestrator or MAIDXO for short. Yeah. And our mission for this deep dive really is to unpack the, well, the intricate methods and, frankly, the impressive results of this research. It's pretty groundbreaking. 0:41 Okay. We'll be getting into the technical details, you know, exploring how this AI can systematically investigate complex medical challenges Right. And also how its performance is rigorously benchmarked against, well, real world diagnostic scenarios. So the goal is really understanding the how, not just the what. Exactly. 0:58 We wanna help you understand not just what this tech achieves, but how it actually does it and maybe more importantly, what that means for the future of health care. Okay. Let's unpack this then. The context here is pretty crucial, isn't it? We're all seeing this increasing demand for health care everywhere Absolutely. 1:14 Alongside costs that are just, well, rising unsustainably. Yeah. And, critically, we're still battling these persistent barriers to better health outcomes. Things like inaccurate diagnoses or diagnoses that just come too late is a huge problem. It really is. 1:29 And at the same time, people are leaning more and more on digital tools. I mean, get this. Microsoft's own AI products, like Bing and Copilot, they see over 50,000,000 health related sessions every day. Wow. 50,000,000. 1:44 Yeah. People are definitely looking for health answers online. That's right. And that growing reliance, that sheer volume, it means we need highly capable AI. Now historically, when we try to evaluate AI and medical diagnosis, we often use benchmarks like the USMLE. 1:59 The medical licensing exam. Exactly. The United States medical licensing examination. Yeah. And those tests were mainly multiple choice. 2:07 Right. I remember that discussion. So while these generative AI models, you know, they quickly got near perfect scores, which sounds impressive. Yeah. Very impressive. 2:17 This approach often overstated their actual competence. It really favored memorization, like recalling facts, over genuine clinical understanding. Ah, okay. And that kind of obscured their limitations in, you know, real world dynamic situations, which leads to the big question. How do we really test clinical reasoning, not just book smarts? 2:38 That distinction, memorization versus reasoning, that feels key. So if those older tests, the USMLE style fell short, how did this research team rethink it? How do you actually test reasoning? What's the new approach here? Well, the new approach really centers on something called sequential diagnosis. 2:54 Sequential diagnosis. Which is basically the cornerstone of how doctors actually work in the real world. Think of it like an iterative process, almost like a medical detective solving a complex case. A clinician starts with, say, an initial patient presentation. Mhmm. 3:07 Maybe just a cough and fever. Uh-huh. And based on that, they iteratively step by step select questions to ask the patient or diagnostic tests to order. Like blood tests, maybe an X-ray? Exactly. 3:20 They might order blood tests, then maybe a chest X-ray based on those results. And each new piece of information helps refine their understanding, gradually narrowing down the possibilities until they arrive at a final diagnosis, like, say, pneumonia. So it mirrors how a doctor actually thinks through a case over time. Precisely. It's about that dynamic process. 3:41 And to build a benchmark that really captured this complexity, the researchers turned to the New England Journal of Medicine, the NEJM, specifically their case records. The NEJM cases. Yeah. These are famous, aren't they? And known for being among the most diagnostically complex, really intellectually demanding challenges out there? 3:59 Absolutely. Yeah. They often require input from multiple specialists just to figure out what's going on. They represent the frontier of diagnostic difficulty. Sounds like the perfect, well, crucible for testing and advanced AI then. 4:10 It really is. So from these incredibly tough NEJM cases, they created what they call the sequential diagnosis benchmark or SD Bench. SD Bench. Got it. They meticulously took 304 recent NEJM cases and transformed them into these stepwise diagnostic encounters. 4:27 Okay. Now the crucial thing here is that both the AI models and human physicians could interact with these cases iteratively. They could ask questions, order tests. Just like in the real world? Exactly. 4:40 And as new info became available, they'd update their reasoning, trying to narrow down to the final diagnosis, which was then compared against the actual outcome published in the NEJM, the sort of gold standard. Makes sense. And here's a key technical detail. Every investigation requested, every test ordered incurred a virtual cost. Ah, okay. 4:58 So not just accuracy, but efficiency too. Precisely. This allows evaluation across two really crucial dimensions, diagnostic accuracy, obviously, but also resource expenditure, which directly mirrors real world health care, where you have to be mindful of costs and not just order every test under the sun. Right. You can't just run endless tests. 5:15 Okay. Okay. So this is where it gets, I think, really interesting because the research isn't just about plugging individual AI models into this SD bench. No. It introduces something more sophisticated. 5:27 The Microsoft AI diagnostic orchestrator, MAIDXO. What is an AI diagnostic orchestrator? How is that different from just using, say, GPT four on its own? Is it like a a conductor? That's actually a great analogy. 5:42 Yeah. An AI orchestrator, you can think of it as a digital conductor. Its job is to coordinate multiple steps, often complex ones, to achieve a task. Okay. And that's absolutely vital in high stakes situations like clinical workflows. 5:54 MAIDXO's specific design is unique. It essentially emulates a virtual panel of physicians. A panel, like a multiple specialist consulting. Exactly. Each potentially bringing diverse diagnostic approaches, all collaborating digitally to solve the case. 6:08 Interesting. And this architecture, this panel approach, lets it integrate diverse data sources much more effectively than any single model could on its own. Right. Pooling knowledge. Which enhances not just the diagnostic performance, but also things like safety, transparency, and adaptability, all critical in clinical settings. 6:26 And it's model agnostic, I read. What does that mean exactly? That's a really significant advantage. Mhmm. Model agnostic means it's not tied to one specific underlying AI model. 6:36 You could swap in GPT or llama or clod or whatever comes next. Rubble. Okay. This promotes auditability. You can check the process and resilience, which is key because the AI field evolves so fast. 6:48 True. Essentially, the orchestrator can turn any of these large language models into this virtual clinician panel. It It enables the system to ask follow-up questions, order tests, propose a diagnosis Uh-huh. Then run a cost check on its own actions, and even verify its own reasoning before deciding, okay, do I need more info, or am I confident enough to make the call? So it's like it has layers of self checking built in. 7:11 Exactly. It's a layered system of reasoning and decision making. Yep. Much more sophisticated than just a single model's output. And the results Yeah. 7:18 Perform they seem, well, genuinely striking. They really are. The research shows MAIDXO boosted the diagnostic performance of every single foundation model they tested. Yeah. It's right. 7:29 And that list includes all the big names, GBT, Llama, Claude, Gemini, Grok, DeepSeek. Mhmm. The whole lineup. But the top performer, MAIDXO paired with OpenAI's o three. It correctly solved 85.5% of these incredibly tough NEJM benchmark cases. 7:47 Eighty five point five percent on cases that often stump multiple human specialists. It's it's an astounding number. It really is. And then there's the comparison to human performance, which, kind of puts it into perspective. Yes. 7:59 This is fascinating. They evaluated 21 practicing physicians from The US and UK, all experienced, you know, five to twenty years in practice. Okay. Seasoned doctors. Right. 8:07 On the exact same tasks Mhmm. Using the SD bench interface, these human experts achieved a mean accuracy of only twenty percent across the cases they completed. Twenty percent compared to 85.5% for the AI. 20% mean accuracy. So MAIDXO delivered both dramatically higher diagnostic accuracy, and it did so with lower overall virtual testing costs compared to the physicians or indeed any individual foundation model tested on its own. 8:31 Wow. This directly tackles that huge issue of diagnostic overtesting in health care, you know, the millions of unnecessary tests run each year. So better results and less waste. That's the implication. Achieving better outcomes with less expenditure. 8:46 That's potentially a massive game changer for health care economics and patient experience. Okay. Wow. That 20% versus 85.5%. As a listener, that almost sounds unbelievable. 8:56 It's such a huge gap. It is stark. What do you think accounts for that? Is it purely that the human doctors in the test didn't have their usual resources, or is there more to it? That's a really great and crucial question to ask. 9:09 And it's certainly not meant as a criticism of human doctors who operate with incredible skill, empathy, and contextual understanding in the real world. Right. Of course. But you have to consider the specific constraints of this benchmark setup. The human physicians worked in isolation without access to colleagues for a quick consult, without instantly pulling up vast digital libraries or textbooks, and without the ability to virtually order 20 different tests simultaneously Right. 9:35 With no real world cost or time penalty, which the AI could effectively do within the simulation. Right. The AI doesn't get tired or have budget meetings. Exactly. The AI in this specific benchmark context acts like a tireless supercollaborator. 9:50 It has essentially perfect recall across an unimaginable breadth of medical literature. It can process information at speeds no human possibly can, cross referencing rare conditions with subtle symptoms in a way that's just beyond the scope of a single human brain no matter how expert. So the orchestrator design, the MAIDXO, it achieves this kind of synergistic super expertise. That's a good way to put it. A synthesis of knowledge and processing power that no single human could ever realistically master across all potential diagnoses represented in these complex NEJM cases. 10:24 It can blend breadth and depth in a unique way. Precisely. Which really highlights the potential for AI to augment human capabilities, bringing this incredible analytical power to the table rather than replacing the human element entirely. So connecting this back to the bigger picture then, what does this research really mean for the future of health care transformation? It feels like more than just an incremental step. 10:46 Oh, absolutely. The implications are potentially vast. For patients, you could imagine AI like this empowering them, maybe for more effective self management of routine issues, providing accessible, accurate initial guidance. Right. And for clinicians, it could become an incredibly powerful tool, offering advanced decision support, especially for those really complex, puzzling cases that demand integrating just massive amounts of information. 11:13 The ones that end up in the NEJM. Exactly. And beyond the individual diagnosis, these findings strongly suggest AI's potential to significantly reduce unnecessary health care costs. Which we know are a huge burden. A massive burden. 11:26 I mean, consider that US health spending is pushing towards 20% of GDP. Wow. And estimates suggest up to a quarter of that spending might be wasted, meaning it has little actual influence on patient outcomes. A quarter. That's staggering. 11:39 It is. So this is an area where AI, by improving accuracy and reducing unnecessary tests, could have a profound positive impact on the sustainability of our entire health care system. Of course. Like any responsible research, this paper also points out its limitations, right, which is crucial. Absolutely critical. 11:58 They mentioned that while MAIDXO smashed these really complex NEJM cases, it still needs more testing on the more common everyday presentations doctors see most often. That's right. Bread and butter cases need evaluation too. And we already touched on how the human physicians in the study were working under benchmark conditions. No colleagues, no textbooks, no AI assistance themselves. 12:19 Yeah. Which was necessary for a clean comparison to the raw human performance baseline in that specific setup, but it's definitely different from normal clinical practice. A key difference. And they also note that while real world health costs are complex and variable, they used a consistent virtual cost method across all agents, AI and human, to allow for a fair quantifiable comparison of the trade offs involved. Exactly. 12:42 Yeah. And it's absolutely vital to reiterate and emphasize this point. Both the SD Bench and MAR DXO are currently research demonstrations only. Okay. Strictly research at this stage? 12:54 Strictly research. They are not approved for clinical use. The path forward requires really rigorous next steps. We absolutely need real world evidence gathered from diverse clinical environments. Not just benchmark simulations. 13:08 Correct. We need comprehensive safety testing, extensive clinical validation across much broader patient populations, and the careful, thoughtful development of appropriate governance and regulatory frameworks. Makes sense. It has to be safe and reliable. Fundamentally. 13:22 And the vision here, it's important to stress, is that this technology is designed to complement doctors, not replace them. Augmentation, not replacement. Precisely. The idea is to allow clinicians to perhaps automate some routine information gathering, freeing them up, to help identify diseases earlier, to personalize treatment plans based on a much deeper dive into individual patient data, and maybe even enhance that shared decision making process between doctor and patient because both have access to clearer AI analyzed information. It's really about augmenting human expertise and, importantly, human empathy with the analytical power and breadth of machine intelligence. 14:01 Okay. So bringing this all together for our listeners, here's a provocative thought to leave you with. Considering MAIDECO's demonstrated ability, at least in this benchmark, to achieve higher diagnostic accuracy accuracy with significantly lower testing costs. How might this kind of technical capability fundamentally reshape not just that initial diagnostic phase, but perhaps the entire patient journey? That's a big question. 14:24 What new, maybe highly specialized roles could emerge for clinicians? And thinking from the patient perspective, what might change for you as a patient in a future where these AI orchestrators are managing and analyzing complex medical information and workflows alongside your human care team. Lots to think about there. Indeed. Thank you for listening in. 14:44 Subscribe and follow Colaberry on social media. Links are in the description. And check out our website, www.colaberry.ai backslash podcast for more insights like this.