Updated March 5, 2026
0:00 Welcome to Colaberry AI podcast brought to you by Colaberry AI Research Labs and Carl Foundation. Ever found yourself buried in research wishing you had an AI that could actually think, synthesize, and even, learn how to research more effectively just like a human expert? Today, we're doing a deep dive into a truly fascinating breakthrough at the forefront of AI advanced deep research agents powered by large language models or LLMs. Precisely. We're going to unpack a groundbreaking new framework from Google Cloud AI research called the test time diffusion deep researcher or TTD Doctor. 0:35 And this isn't just about, you know, throwing more computational power or more data at an LLM. No. Yeah. No. It's a fundamental rethinking of how these agents tackle complex long form research report generation. 0:46 It draws profound inspiration from the iterative adaptive nature of human cognitive processes. That's a powerful concept. So okay. Let's unpack this TPDBR. Our mission for you today is to explore the current limitations of AI and deep research, The ingenious human inspired ideas that, potentially crack that code. 1:02 The technical mechanisms that make TTDDR a reality and of course, the impressive results it's achieving. We'll get into the precise details, the nitty gritty of its diffusion process and self evolutionary components. We wanna understand not just what they do, but how they work at a technical level. Right. And to start, we should probably talk about why this is needed. 1:22 Mhmm. Because deep research agents or Doctor agents powered by LLMs, they've seen rapid advancements in capability. Oh, yeah. Things like idea generation Bunker. Information gathering with search tools, even executing analyses. 1:35 Exactly. But they consistently hit a performance plateau when the task complexity really escalates. A plateau? Yeah. It's not a minor slowdown. 1:44 It's a significant barrier to generating truly comprehensive, high quality, long form research reports, the kind of stuff that approaches human expertise. So what's the core issue causing this bottleneck when we're talking about generating a really complex multifaceted research report? Well, the existing methods, they rely on these, let's say, generic test time scaling algorithms, things like chain of thought prompting. Right. Or best m sampling, simple self refinement loops, that kind of thing. 2:12 Exactly. And they do improve performance to a degree. Sure. But they just fall short for these intricate, multi step research tasks. It's like trying to navigate a dense jungle with just a basic map, maybe. 2:22 That's a good way to put it. The key limitation, as identified in the source material, is the lack of a deliberate design that truly mimics human cognitive behavior in writing and research. Current agents often just compile various algorithms and tools, maybe in a somewhat linear fashion. Without that dynamic back and forth. Precisely. 2:42 Without a principled dynamic draft search and feedback mechanism. And this absence leads to critical issues. Like what? Like a loss of global context over long research trajectories and a failure to identify and address critical dependencies between different pieces of information as the research progresses. Imagine trying to write a a complex academic paper without ever revisiting your outline. 3:06 Or going back to earlier sections to check consistency. Or even searching for new information based on something you just wrote. That's the challenge these current agents face. That's a perfect analogy. And here's where TTD Doctor gets really interesting. 3:19 Right? Because the framework is inspired by how you and I, as humans, actually conduct research and write. When we tackle complex topics, we don't just follow a rigid, linear path. We plan, sure, but then we draft, and then we engage in multiple, often recursive revision cycles. Absolutely. 3:38 It's often a messy but highly effective process of continuous refinement. Think about it. We start with a plan, maybe a high level one, generate a rough draft, then enter these revision phases. Enduring revision, we're constantly looking for more info, aren't we? Exactly. 3:53 Crucially, human writers actively seek out supplementary information. We go back to our sources, conduct new searches, maybe consult experts, all to refine and strengthen our arguments. This constant interplay between drafting and external retrieval is key. And the paper draws a parallel here. A striking resemblance actually between this natural human pattern and the sampling process in a diffusion model augmented by retrieval. 4:16 Okay. Diffusion models, like taking a noisy blurry image and iteratively refining it step by step into a clear one. Precisely that analogy. That's the core idea for t t d d r. You start with a rough, perhaps noisy initial draft of a report and then make it more defined, more comprehensive, more accurate with each iterative step. 4:38 Like sharpening the focus. Exactly. So the TTDDR framework formally conceptualizes this research report generation as an iterative diffusion process. It begins with a preliminary, often noisy draft. You can think of it as an updatable skeleton or a very early version of the final report. 4:55 So this draft isn't static? Not at all. It serves as an evolving foundation, dynamically guiding the subsequent research direction. This draft then undergoes iterative refinement through this denoising process. Just like the image diffusion model removing noise. 5:09 Right. Adding detail and coherence with each step. So it's not just about producing a final report from some initial searches and calling it done. The report itself evolves, and its current state actively guides the next step in research and information gathering. And this denoising it's not just internal LLM stuff. 5:26 It uses outside info. That's critical. It's dynamically informed by a retrieval mechanism. It actively pulls in external real time information at each step, making sure the most current and relevant data gets integrated. Wow. 5:39 Okay. This draft centric design sounds like a big shift. It is. It's critical because it makes the report writing process more timely and coherent. By having this continuously evolving draft as the central artifact, it significantly reduces the information loss. 5:55 That you mentioned plagues other systems where context degrades over long searches. Exactly. The evolving report acts as a persistent memory and, like a guiding compass for the whole process. That makes perfect sense. So how does this diffusion process actually operate within the agent? 6:10 You mentioned a backbone architecture. Can you walk us through the basic stages? Absolutely. To put these advanced mechanisms, the diffusion and self evolution into context, TTDDR uses a robust three stage backbone deep research agent. Okay. 6:24 Stage one. First, we have stage one, research plan generation. Here, a dedicated unit LLM agent creates a structured comprehensive plan. The blueprint. Pretty much. 6:37 It outlines the key areas, sections, subtopics for the final report. It provides that initial high level strategic direction. So like a human researcher, it outlines the approach first. Mhmm. What's next? 6:48 Then we move into stage two, iterative search and synthesis. This is the core loop where the bulk of the research happens, and it contains two critical sub agents. Two parts to this stage. Yes. First, stage two a, search question generation. 7:02 Based on that initial plan, the user's query, and importantly, the context from previous search iterations. And maybe the evolving draft itself? Yes. Especially when the diffusion process is active. This sub agent formulates highly targeted search queries, not just simple keywords, but intelligent questions designed to extract specific relevant information. 7:21 Amen. Following that, stage two b, answer searching. This sub agent uses a retrieval augmented generation or RE like system, often grounding its searches with a powerful tool like Google search. To find relevant document. Yeah. 7:34 Right. But, crucially, it also synthesizes precise answers from that raw data. It doesn't just give you a list of links. It provides a synthesized piece of information. And this loop repeats. 7:44 Exactly. This whole loop of question generation and answer synthesis continues iteratively until the research plan is adequately covered, or maybe it hits a predetermined limit. So it's constantly asking, searching, synthesizing, asking, searching, synthesizing. Got it. And finally, stage three. 7:59 Stage three is final report generation. Here, a final unit LLM agent takes all that gather information, the structured plan from stage one, the entire series of refined question answer pairs from stage two, and synthesizes it. Into the final product. Into a comprehensive, coherent, and polished final research report. This stage brings everything together. 8:20 That clarifies the structure nicely. Now how do those human inspired ideas, the diffusion and self evolution, enhance this backbone? Let's get into the technical details of those two main mechanisms. First up, component wise self evolution. What's that about? 8:35 Okay. So beyond refining the overall report through that diffusion process, TTDDR also significantly enhances the quality of each individual component within its workflow. So not just the final report, but every little step along the way, like generating a search query. Exactly. Every granular piece generating the plan, a search query, synthesizing an answer, even initial drafts gets smarter and more accurate through an evolutionary process. 9:01 It sounds like a continuous learning loop for each subtask. Yeah. How does this component wise self evolutionary algorithm actually work? Well, this mechanism, as the paper illustrates, is applied to those critical components we mentioned. The goal is to intelligently explore diverse knowledge spaces and, mitigate information loss at the level of each unit agent. 9:23 This ensures high quality context feeds into the main diffusion process. Okay. How practically? Practically, it starts by generating multiple diverse variants of an output for a specific component. Like different ways to phrase a search query or different potential answers? 9:37 Precisely. Say several possible answers to a single search query, and it achieves this diversity by using varied LLM parameters, things like temperature or top sampling. Tweaking the LLM's randomness or creativity settings. Right. That allows the system to explore a much wider search space for potentially more valuable or nuanced information than you'd get from a single deterministic output. 10:01 So it generates multiple hypotheses, not just one best guess. Then what happens to these different versions? Each variant then receives environmental feedback, and this comes from an LLM as a judge. An AI judging another AI's work. Kinda like an AI acting as a critical peer reviewer. 10:17 Exactly. Yeah. This judge model provides quantifiable fitness scores, metrics like helpfulness and comprehensiveness, along with detailed textual critiques pointing out what could be better. Okay. So it gets rated and gets notes. 10:28 Yep. And based on that feedback, these variants then undergo revision steps in a continuous loop. They adapt and improve. Learning from the critique. Correct. 10:38 Finally, multiple revised variants, the ones that performed best according to the judge, are merged into a single high quality output through a crossover process. Crossover, like in genetics. Conceptually similar. It effectively consolidates the best information and insights evolutionary paths into one superior output for that component. It's like an internal optimization loop for every key decision the agent makes. 11:04 That's remarkably sophisticated like having an internal brainstorming session and peer review for every single step. Okay. Let's move to the second core mechanism for port level denoising with retrieval. You mentioned the diffusion model analogy for refining images. How does that apply specifically to a research report? 11:20 Right. This mechanism is all about integrating the agent's findings in a more timely and coherent way, constantly guiding the research direction and maintaining that global context. As we discussed, an LLM generates an initial draft report that's the noisy starting point. Then this current draft report is fed directly back into stage two a. The search question generation stage. 11:41 Exactly. Into stage two a of the backbone workflow. And this is crucial. The evolving draft itself dictates the next search query. So the report isn't just a passive output container. 11:52 It's an active participant, like an internal compass, pointing the agent towards what information it needs next based on what it already has. That's a great way to put it. It's highly dynamic. And after the search in stage two a and the synthesis in two b, how does that new info get back into the report? So after a synthesized answer is obtained in stage two b using that RG like system, this new information is used to revise the current report draft. 12:17 How does it revise it? Adding things, changing things. It could be adding new details, correcting inaccuracies it found, verifying existing information, or even reorganizing sections based on the new insights. This forms a continuous feedback loop. The draft guides the search. 12:32 And the search refines the draft. This dynamic interplay ensures the report stays coherent, the research stays focused, and information is integrated incrementally step by step. And this just keeps going. It repeats until the search concludes. The result is a final denoised report based on all the historical search answers and revisions made along the way. 12:52 And the synergy between this report level denoising and the component wise self evolution is, as the paper highlights, really critical for achieving those high quality comprehensive research outcomes. This all sounds incredibly ambitious, making an AI truly iterative, self improving, almost reflective. But the proof is in the hood. Alright? How do we actually measure if it works better than what's already out there? 13:17 The paper details a pretty comprehensive experimental framework. They did. They employed a multifaceted approach because evaluating these long form LLM outputs is frankly challenging. Yeah. It's not like checking a math problem. 13:28 Exactly. So for comprehensive long form responses like these research reports, they focused on two primary metrics assessed by both human and AI evaluators. Okay. What are they? First, helpfulness. 13:40 This looks at how well the report satisfies the user's actual intent, how easy it is to understand fluency, coherent its factual accuracy, and the appropriateness of the language used. Makes sense. And the second. Second, comprehensiveness. This metric is defined pretty simply. 13:56 The absence of missing key information. Is the report thorough and complete? These give you that robust qualitative assessment. And to quantify these somewhat subjective qualities, they used side by side quality comparison, pairwise evaluation. Yes. 14:12 Exactly. Where human raters directly compare two reports, say, one from TTDDR and one from a baseline and judge their preference on a granular scale, like report a is much better, slightly better, about the same, and so on. Okay. And what about for tasks that aren't long reports, maybe shorter answers? Right. 14:29 For multi hop short form question answering tasks, they used correctness, basically comparing the LLM generated answer to established ground truths. Is it factually right or wrong? And they use AI judges too. Yes. And this was crucial for scale. 14:42 To evaluate long form outputs efficiently, they leveraged an LM as a judge model. Specifically, they use Gemini 1.5 pro. But how reliable is an AI judging AI? That's the key question. They meticulously calibrated this AI judge. 14:56 They had it rate 200 reports from their agents and compared its ratings against human ratings for the same reports, specifically comparing against OpenAI deep research outputs. They found strong alignment ensuring the AI judge was reliable and reflected human preferences well. This allowed for much broader and faster evaluation than humans alone could manage. That's smart. And they tested this on some tough benchmarks, didn't they? 15:22 Not just simple stuff. Absolutely. They didn't shy away from challenges. They used datasets like long form research and deep consult for the comprehensive report generation tasks, which require synthesizing huge amounts of info. And for the harder reasoning tasks. 15:36 For multi hop search and reasoning, they used humanity's last exam, HLE and GAIA. These datasets have queries that sometimes need up to 20 distinct search steps to find the answer. Wow. 20 steps. That's way beyond a simple search. 15:49 That requires serious reasoning and keeping track of information over time. Exactly. It demands extensive reasoning and really intricate information retrieval strategies. So with all that rigorous evaluation, what did the results actually show? Did ttDDR live up to the hype? 16:05 Well, the results reported are quite compelling. They seem to demonstrate significant advancements across the board. The source material indicates that ttDDR consistently achieves state of the art results across all these benchmarks. Outperforming existing agents. Substantially outperforming existing deep research agents, including strong baselines like OpenAI's deep research agent. 16:25 Okay. Give me some numbers. How much better? On those long form research and deep consult tasks, the ones focused on generating comprehensive reports, TTDDR achieved impressive win rates of 69.174.5%, respectively. Win rates in the side by side comparisons? 16:41 Yes. Against the OpenAI deep research baseline. That's a substantial margin of preference, meaning raters preferred the t to g d r output roughly 70% of the time in those comparisons. That's significant. Out of multi hop QA datasets, HLE and GAIA. 16:56 It also outperformed there, showing absolute score improvements of 4.8%, 7.7%, and 1.7% respectively against the same OpenAI baseline. While those percentages might seem smaller, these are tough benchmarks, so even modest gains are noteworthy. Interesting. Did they break down which parts of t t d r contributed most? The ablation study? 17:18 Yes. The ablation study was quite revealing. It systematically removed parts of the t t d r system to see the impact. What did it show? It showed that, okay, even advanced LLMs with just simple search tools offer some improvement over no search at all, but they significantly underperformed the backbone Doctor agent, the three stage structure without the fancy new mechanism. 17:40 So the basic structure itself is already an improvement. Right. Then adding just the self evolution algorithm to that backbone leads to substantial gains. For instance, on long form research, just adding self evolution achieved a 60.9% win rate against the OpenAI baseline that shows its standalone power. But the real magic happens when you put it all together. 17:58 Precisely. When diffusion with retrieval is incorporated alongside self evolution, t t d d r achieves substantial additional gains across all benchmarks. That combination solidifies its superior performance. And you mentioned efficiency, the Pareto frontier. Yes. 18:14 Figure seven in the source shows the Pareto frontier. It plots performance gain against computational latency, basically. It can't. It shows t t d r achieves greater performance gains for each unit increase in latency compared to simpler methods. It means it's not just more powerful, but it's also exceptionally efficient in how it uses computation during this test time scaling. 18:34 It gets more improvement for the resources it uses. Better results more efficiently. Yeah. That's the sweet spot. Okay. 18:40 Let's dig a bit deeper. Why are these mechanisms so effective? What did the analysis show about their specific impact? Good question. The analysis suggests that self evolution significantly increases the, complexity and richness of the intermediate outputs, like the search questions and the synthesized answers. 18:56 How do they measure richness? They measured it quantitatively using metrics like the number of key points extracted from the generated intricate inquiries at each step, self evolution directly enriches the information being gathered. So better ingredients for the final report? Essentially, yes. Yeah. 19:15 It prevents the agent from just asking simple questions or getting stuck in shallow research loops. And what about the denoising with retrieval? You mentioned novelty and timeliness. Right. The data shows it dramatically increases search query novelty. 19:30 Because the revised report guides the next search, it pushes the agent to explore new previously unasked questions related to the evolving content. So it avoids just asking variations of the same question over and over. Exactly. Exactly. It ensures the research process is continuously pushing into new information frontiers rather than retreading old ground. 19:50 And the timeliness aspect. Furthermore, it enables timely incorporation of information. The analysis shows that TTDDDR integrates a much higher percentage of the final report's key information earlier in the search process compared to methods without it. Like, how much earlier? For example, the paper mentions that at just step nine in a multi step search process, TTDDR might have already incorporated over 50% of the information that eventually makes it into the final report. 20:16 Wow. That's fast integration. It is. This leads to faster knowledge preservation and a more efficient learning curve as the agent progresses. This rapid, coherent integration seems to be a key differentiator driving its superior performance. 20:30 So tying this all together, what does this mean for the bigger picture of AI and for you, our listener, maybe trying to use these tools? Well, the TTTDR framework offers a powerful new paradigm. It's a new way for LLM powered agents to generate genuinely comprehensive, coherent, and accurate research reports. It seems particularly effective at addressing those complex multi hop queries that current state of the art LLMs often struggle with. Especially those relying just on their internal knowledge or maybe simpler, less integrated search tools. 21:00 It's exact. And this whole approach of test time scaling focusing on iterative refinement, self optimization without needing more training data that seems significant. It really is. It's potentially a blueprint for building more capable and importantly more reliable AI research companions. Think about applications across diverse industries. 21:19 Scientific discovery, market analysis, biomedical research. Any field requiring deep synthesis of information, it really pushes the boundaries of what we might consider autonomous intelligent research by an AI. But it's not the final frontier. Right? The paper mentioned future directions. 21:37 Indeed. They acknowledge areas for future exploration. Currently, TTDDR primarily focuses on using search tools for getting information. What else could it use? Future work could integrate other crucial tools, Things like active web browsing, interacting with complex APIs, maybe even tools for running code or simulations as part of the research. 21:56 That would make these agents even more versatile. Absolutely. It would further expand their capabilities. But even as it stands, this deep dive has really given us a fascinating look into what might be the next generation of AI research combining these human inspired cognitive processes with advanced machine learning techniques. Thank you for listening in. 22:14 Subscribe and follow Colaberry on social media links in the description, and check out our website, www.colaberry.ai backslash podcast for more insights like this.