Updated March 5, 2026
0:00 Welcome to Colaberry AI podcast brought to you by Colaberry AI Research Labs and Carl Foundation. Today, we're taking a deep dive into, well, a really important and timely topic. It's all about how cutting edge AI, specifically these large language models, are actually moving out of the lab and into the, you know, the messy crucial environment of real world health care settings. Our sources are highlighting the significant gap. They actually call it the model implementation gap. 0:24 Basically, the difference between what these powerful AI models can do in theory and their actual boots on the ground deployment for genuine patient benefit. So maybe to kick us off, what exactly is this model implementation gap and, how are researchers trying to tackle it? Yeah. It's a fascinating area. What's really striking is that, despite large language models, LLMs, often showing they could perform better than human doctors on certain health benchmarks. 0:47 I mean, that's a huge leap, right, in potential accuracy, reliability, even safety. Studies keep showing this. But their actual large scale use in, you know, busy clinics, it's remained pretty limited. So that model implementation gap is exactly that challenge. Taking impressive lab results and making them work effectively in the real world. 1:06 The specific paper we're digging into today really zeros in on clinical decision support systems, CDS, as a kind of frontier use case. The main goal of a CDS system, especially an AI powered one, is to give clinicians the relevant evidence based info right when they need it at the point of care, bringing that benchmark performance into practice. Okay. So we see the why, this gap between AI's promise, and its everyday use in health care. Let's get into the how. 1:32 Our sources talk about this really interesting real world test of an LLM powered CDS tool. They called it AI consult. Where exactly did they put this tool through its paces? What were the conditions like there? Right. 1:43 So this study happened in Kenya within Penda Health. Penda is a pretty big network of primary medical centers, mostly around Nairobi. The clinicians there, they're called clinical officers, and they handle an incredibly wide range of conditions every single day, which, you know, given high patient numbers and maybe limited resources, can lead to quality challenges. Things like, prescribing antibiotics too often or delays in diagnosis, common issues. But Penn to Health was actually a great place for this because they already had a solid digital infrastructure, their EMR system, and quality programs in place. 2:17 So it provided this ideal kind of high volume, diverse setting to really test the AI tool's impact rigorously. They had the setup and the willingness. Okay. So a complex environment, high stakes. The design of AI consult must have been absolutely critical then. 2:31 What were the sort of core technical ideas and the, maybe, the iterative steps they took to build something that could actually work there. Oh, absolutely critical. The design thinking behind AI consult was key. They didn't want just an on demand tool. They saw it as a kind of continuous safety net running in the background. 2:48 It was built with three main technical goals in mind. First, maximize coverage. The idea was the model reviews every patient visit and every major decision point in that visit. Without the clinician having to ask? Exactly. 3:00 Without them asking, it runs automatically. Second, minimize cognitive load. This is huge. Feedback only interrupts the workflow if a genuine material risk is spotted. Trying to avoid that dreaded alert fatigue you hear so much about in medicine. 3:14 Right. Where people just start ignoring alerts because there are too many. Precisely. And third, maintain clinician autonomy. The system gives recommendations, suggestions, but the final call, always with the clinician. 3:25 To manage this, they came up with that traffic light interface you mentioned. Yeah. Green, no worries. Yellow. Moderate concern, maybe view it via a little bell icon if you want. 3:34 And red, safety critical issue. That required a mandatory pop up review before they could move on. That sounds smart. Tiered alerts. Yeah. 3:41 It was really about balancing finding important issues with not overwhelming the clinicians in a busy setting. That traffic light system sounds clever. Definitely helps with the alert fatigue problem. But how did they actually weave helps with the alert fatigue problem. But how did they actually weave this into the clinic's daily operations? 3:55 How does it connect to their existing EMR, their digital records, and what kind of AI is actually crunching the data? Okay. Technically, the architecture is asynchronous and event driven. That's vital for it to feel responsive, not sluggish. AI consult is embedded right into Penda's own cloud hosted electronic medical record, which is called EasyClinic. 4:16 It cleverly triggers calls to the LLM in the background. Basically, whenever a clinician finishes typing in or clicks away from critical fields. Like the main complaint or the diagnosis. Exactly. Chief complaint, clinical notes, tests, order, diagnosis, medications, prescribe those key points. 4:31 It allows for almost real time analysis as the clinician works. Now for the l 11 itself, while the system design could technically use different models, it's model agnostic. Right. This specific study, Penda chose to use g p t four o. K. 4:44 G p t four o. Why that one specifically? Two main reasons. First, it's strong few shot reasoning. That means it learns tasks well from just a few examples, which is efficient. 4:54 And second, crucially, it's low latency. Getting feedback to the clinician quickly, almost instantly was a top priority. If it's slow, adoption. Thanks. Makes sense. 5:03 Can't wait minutes for advice. No way. And they put a lot of effort into prompt engineering, giving the LLM detailed context like penndasone treatment protocols, Kenyan clinical guidelines, and then they augmented that with carefully made examples showing it exactly what a good red, yellow, or green response should look like in their context, really guiding its output and alert severity. So the design sounds thoughtful. The tech choice focused on speed. 5:29 But we know the real hurdle often isn't just the AI model itself. Right? It's making it fit with how people actually work. What kind of iteration, what testing did they have to do to really refine AI consult and make sure it didn't just, you know, get in the way? You've hit the nail on the head. 5:43 Integration and refinement were absolutely central. They went through many, many cycles. For instance, they found out pretty early on that if the AI alert came too late, like when the clinician was basically done sending the prescription to the pharmacy. It's harder to backtrack them. Exactly. 6:00 It caused friction. Clinicians didn't wanna undo things they'd already decided or communicated, so that feedback led them to reengineer it to trigger earlier. Specifically, when users focused out of those key EMR fields we mentioned gives an earlier chance to catch something. Smart adjustment. Yeah. 6:18 And they also had to really fine tune the alert thresholds using ongoing prompt engineering, like missing vital signs. Definitely a critical red alert. But initially, maybe it flagged things like overly detailed but not safety critical historical notes as read too often. Leading to that alert fatigue again. Right. 6:35 So they adjusted, made the threshold a bit more lenient for things that weren't immediate safety risks, showing that nuanced understanding of what's truly critical versus just nice to have. And there was this induction period early in the study. They did clinician shadowing, got continuous feedback, and boom, hit a major technical snag. Uh-oh. What was that? 6:56 System slowness. Too many simultaneous API calls were bogging things down. The real time feedback just wasn't happening consistently. That could kill adoption right there. Absolutely. 7:06 So Penda's engineers had to basically re architect parts of the code, optimize how it handled requests asynchronously. They managed to get the average response time down below three seconds. That was a huge technical win and really boosted usability and clinician buy in. Wow. Okay. 7:21 That's some serious behind the scenes work, not just plug and play AI. So let's shift to the study itself. It sounds pretty large scale, tens of thousands of visits. How did they actually measure if AI consult was making a difference on clinical quality, errors, and maybe patient outcomes? Yeah. 7:36 It was a robust study design, a pragmatic cluster assigned trial. We're talking 39,849 patient visits, a 106 clinicians across 15 different clinics. They split the clinicians into two main groups. The AI group got the live AI consult alerts. The non AI group had AI consult running silently in shadow mode. 7:56 Shadow mode. So it recorded what it would have said? Exactly. It logged all the data, all the potential alerts, but didn't show them to the clinicians. That gave them a crucial baseline for comparison. 8:06 They focused their measurements on three key areas using pretty rigorous methods. First, clinical quality. This was fascinating. They had a panel of a 108 independent physicians completely blinded. They didn't know which group the notes came from. 8:19 These docs reviewed anonymized clinical notes using standard five point leckered at scales across four areas, history taking, investigations, diagnosis, and treatment. And they didn't just rate. They specifically counted defined failure modes or errors. Okay. So independent blinded review, that adds a lot of weight. 8:34 What about the stats? Right. They calculated things like relative risk reduction, RRR, how much the risk of an error decrease in the AI group, and number needed to treat, NNT, how many patients need the AI support to prevent one error. Crucially, to handle the real world data complexities, like variation between clinicians, they used sophisticated stats, Fisher's exact test, generalized estimating equations, GEE, and modified Poisson regression. These help isolate the effect of the AI itself. 9:04 Got it. So quality was number one. What else? Second was use and usability. Pretty straightforward clinician surveys asking about their experience plus objective data, like how long visits took or how long the clinical notes were. 9:15 And third, patient reported outcomes. They tried to collect this via routine follow-up calls Penda already does, asking patients how they felt after about eight days. Plus, they tracked any serious patient safety reports, PSRs. Okay. Quite comprehensive. 9:29 So the big reveal. After all that careful setup and measurement, what did they find? Did AI consult actually reduce errors significantly? Yes. Absolutely. 9:37 The results on error reduction were really quite striking and statistically significant. Meaningful improvements for the AI group compared to the non AI group across the board. For history taking errors, they saw a thirty one point eight percent relative risk reduction. That's nearly a third less errors in just gathering patient history. Wow. 9:54 For investigation errors like ordering the right tests, a 10.3% RRR. Diagnostic errors saw a 16% RRR. And treatment errors, things like medication choices showed a 12.7% RRR. Those are solid numbers. What does that mean in practical terms, like, with the NNT? 10:11 The NNTs were quite low, which is good. It means the tool is efficient. It was 18.1 for diagnostic errors. So support 18 patients with AI consult prevent one diagnostic mistake and even lower for treatment errors, 13.9. Support about 14 patients to prevent one treatment error. 10:26 When you scale that up. Exactly. If Penda roll this out across their roughly 400,000 annual visits, the estimates are it could prevent something like 22,000 diagnostic errors and nearly twenty nine thousand treatment errors every single year. And interestingly, these error reductions were even stronger during the later active deployment phase when they were really focusing on engagement and coaching compared to the initial induction period, especially for history, diagnosis, and treatment. That's genuinely impressive impact potential. 10:56 Tens of thousands of errors avoided. But here's something I find really interesting. Did the tool just catch mistakes as they happened, or did it actually help clinicians get better over time? Did they see any evidence of learning? That is a fantastic question, and, yes, the data strongly suggests a learning effect was happening. 11:14 This is maybe one of the most exciting parts. They looked at the proportion of visits in the AI group that had an initial red alert, meaning the AI flagged a critical issue early on. That percentage actually dropped significantly during the study from about forty five percent down to thirty five percent. So clinicians were making fewer critical errors before the AI even flagged them? Precisely. 11:34 It suggests they were internalizing the feedback, improving their documentation, catching potential issues themselves proactively. It points towards the tool genuinely enhancing their intrinsic clinical practice, not just acting as a reactive crutch. Yeah. And another piece of evidence, the left in red rate. That's when an alert was shown, but the final record still had the red flag issue. 11:56 That rate also decreased significantly in the AI group compared to the shadow mode group. It shows they were actively paying attention to the critical alerts and changing their plans based on the AI's guidance. That's huge. It's moving beyond just error correction to actually improving clinician skill and decision making. What about the people involved, or how did the clinicians feel about it? 12:16 And what about the patients? Clinician feedback was overwhelmingly positive. Those in the AI group reported much higher satisfaction. They really felt AI consult helped them deliver better care. They got a net promoter score of 78, which is exceptionally high for any system in health care, frankly. 12:32 Wow. Yeah. And qualitatively, you heard clinicians calling you things like an active clinical partner, something that helped them access real time evidence based practices. Okay. So they liked it. 12:42 Did it slow them down though? Interestingly, yes. Slightly. The median visit time for the AI group was a bit longer, but sixteen point four minutes versus thirteen minutes for the non AI group. Is that a problem? 12:55 Well, here's the crucial part. That extra time was directly linked to them interacting with the AI feedback, and that interaction correlated directly with fewer treatment errors. So, yes, a bit more time, but it seemed to be time well spent improving care quality. Plus, their clinical notes were consistently longer and more detailed too. Better documentation overall. 13:16 Okay. That context is important. What about the ultimate goal patient outcome? Right. The patient outcomes piece. 13:22 Here, the study did not find a statistically significant difference in the general patient reported outcomes, like patients saying they felt better after eight days. Oh, okay. So it didn't make patients feel better. Well, it's not quite that simple. It's really important to understand the limitations. 13:37 The study just wasn't designed or powered sufficiently to detect that kind of effect reliably. Sample size, maybe the measure itself, and they also had quite a bit of missing data for that specific outcome. So we can't definitively say it didn't help from the patient perspective based on this data alone. Exactly. But what is critically important is that there was no evidence of AI consult causing active harm. 14:01 No patient safety reports linked it to negative events. In fact, they noted several serious cases, including two where patients sadly died, where if the AI consult guidance had been available or followed, harm might potentially have been prevented. So it acted as a safety net, potentially averting harm even if the positive patient reported benefit wasn't clearly measurable in this specific study setup. Precisely. A robust safety signal even without the clear outcome signal yet. 14:28 Okay. So this really paints a picture. It's not just about having a powerful AI model. It's the design, the integration, the iteration, the active deployment strategy. Looking forward, what does this study tell us about successfully getting these kinds of tools into real clinics? 14:46 What are the key ingredients? I think this work really highlights three absolutely essential components Mhmm. And they all have to work together. First, you obviously need capable models. The LLMs themselves are getting incredibly powerful, moving beyond just benchmarks as we've seen. 15:00 That's becoming more available. Second, and this is maybe the hardest part, is clinically aligned implementation. It's all about that deep workflow integration, that user centered iterative design process. You know, all that effort on prompt engineering, figuring out the exact right moment to trigger an alert in the EMR, tackling latency to get under three seconds, those details matter immensely. It has to fit the clinical reality. 15:22 It's not just dropping tech onto doctors? Not at all. And third is active deployment. You can't just switch it on and walk away. You need strategies to build real understanding and buy in from clinicians. 15:34 Things like having peer champions, connection, using metrics like that left and red rate for personalized coaching measurement, maybe even incentives. Incentives, the study clearly showed the impact was way bigger during that active deployment phase. Technology needs change management and support to truly succeed. A powerful trifecta. Capable models, aligned implementation, and active deployment. 15:56 That seems like a solid framework. So for you, the listener, wrapping this deep dive up, what's the big takeaway here? And maybe what's a provocative thought to leave you with? Well, connecting the dots, I think this study gives us a really practical, effective, and importantly, safe blueprint for how we can actually embed these AI systems into demanding clinical workflows Yeah. Especially primary care, which is foundational health care worldwide. 16:20 It really underscores that the biggest hurdle for health AI right now isn't necessarily building smarter models. It's closing that model implementation gap, making the tech work reliably and beneficially in the real world day in, day out. Okay. So here's a provocative thought for you to mull over then. If AI can learn from our collective clinical experience, identify patterns we might miss, and essentially teach us through tools like this to make fewer mistakes, what entirely new level of human AI collaboration might we unlock as these deeply integrated learning systems become not just experimental, expected standard of care. 16:53 What does that future look like? Thank you for listening in. Subscribe and follow Colaberry on social media links in the description, and check out our website, www.colaberry.ai backslash podcast for more insights like this.