Updated March 25, 2026
0:00 Welcome to Colaberry AI podcast brought to you by Colaberry AI Research Labs and Carl Foundation. And we are, so glad you were joining us for this deep dive today. We really are. So, pull up a chair because today we're firmly putting on our engineer hats. Oh, yeah. 0:14 We are getting deep into the weeds today. Exactly. We are looking at a fundamental shift in how artificial intelligence actually operates under the hood. So if you've been feeling like your AI tools are hitting, you know, a brick wall when you try to get them to learn your personal preferences mid conversation, you you aren't crazy. No. 0:33 You're definitely not. It's a known structural issue. Right. You are basically experiencing the difference between a static map and a real time GPS. A paper map is this massive brilliant snapshot of data. 0:45 But if a bridge is out, the map doesn't update. It just well, it just sits there. It's completely blind to real time changes. Exactly. And that's essentially where our most powerful large language models have been living. 0:56 They are brilliant static maps. But today, our mission is to examine how they finally become a dynamic GPS. We have a fascinating YouTube transcript from the AI Revolution channel to guide us. Yeah. The video is titled Google Just Dropped Bayesian, AI That Evolves in Real Time, and it is incredibly technical. 1:17 It really is. We are focusing strictly on the specific methods, the results, and the architecture of this next generation of AI infrastructure. We're talking four major breakthroughs today. Right. So we're gonna figure out how engineers are teaching neural networks to actually think in real time probabilities. 1:33 And then how they're shrinking those massive brains down to fit entirely inside your pocket. Plus the architecture of letting those pocket sized brains execute complex software on their own. Yeah. And finally, the terrifying security risks of giving that agent the keys to the enterprise. It's a massive technical leap across the board. 1:49 But to start, before an AI can take action, it needs to be able to reason dynamically. Right. And to understand why that's even necessary, we have to look at the baseline problem with large language models or LLMs. Because they're just, they're great mimics. Right? 2:04 Exactly. They're phenomenal mimics, but they exhibit a massive structural failure in probabilistic reasoning. So to isolate this failure, researchers at Google set up a highly controlled five round interaction scenario. Kind of like a stress test. Exactly like a stress test. 2:21 They built a flight assistant environment. The AI basically had to recommend flights to a user based on specific competing features. So things like ticket price, total travel duration, and the number of stops. And it's important to note they weren't just testing lightweight, you know, experimental models on this. No. 2:38 Not at all. The testing lineup included the absolute heavyweights of the industry. We're talking Gemini 1.5 Pro, GPT 4.1 Mini, Llama three seventy b, and Quinn 2.532 b. The biggest models available. And behind the scenes of this test environment, the researchers programmed a mathematically rigorous reward function. 2:56 A reward function. Right. Think of this function as the user's true unstated preferences. Maybe you, as the user, heavily prioritize saving money above all else. Or maybe you value a nonstop flight and you don't care about the cost at all. 3:11 Exactly. In a functional dynamic reasoning system, the AI should refine its understanding of that hidden reward function after every single choice the user makes. It should narrow down the possibilities. Right. Until it perfectly understands what you want. 3:25 But the data from the deep dive showed something entirely different. The researchers discovered a phenomenon they termed the one and done plateau. The one and done plateau. It's such a great term for it. It really is. 3:37 Basically, most of these massive billion parameter LLMs improve their flight predictions slightly after the very first round of interaction. Just that first round? Yeah. But then, they completely flatlined. By round three, four, and five, they completely failed to update their internal beliefs about the user even as the user provided more and clearer evidence of what they actually wanted. 3:58 They just stubbornly stuck to their initial guess. Okay. Let's unpack this for a second. If these LLMs already have billions of parameters and vast seemingly infinite knowledge bases trained on the entire Internet Which they do. Right. 4:13 Why are they so inherently resistant to updating their beliefs mid conversation? Well, what's fascinating here is that the resistance is baked into their very architecture. An LLM is fundamentally designed for pattern matching against its original training distribution. So it's looking backward, not forward. Exactly. 4:32 Its attention mechanisms look at the prompt you just typed, but its underlying weights, the actual physical neural pathways dictating its beliefs, are locked. They don't change after training. They don't. The model simply doesn't possess a structural mathematical mechanism to dynamically calculate shifting probabilities on the fly. Wow. 4:52 So to fix this hardware level stubbornness, the researchers had to throw out traditional training altogether and introduce something called Bayesian teaching. Bayesian teaching. So they fundamentally changed the target the AI was trying to hit. Did he Because, traditionally, AI models are trained using what the industry calls Oracle teaching. You give the AI a prompt, and you give it the perfectly correct final answer to replicate. 5:14 Which is just mimicry. Right. It's basically like giving a student the final answer key to a complex calculus exam exam and just telling them to memorize it. And memorization doesn't teach you how to solve a novel problem. So to show the models what actual learning looks like, researchers built a purely symbolic system, a Bayesian assistant. 5:34 And this wasn't a neural network at all. Right? No. Not at all. It relied strictly on Bayes' rule, which is a foundational mathematical theorem for updating probabilities based on new evidence. 5:43 I love the analogy for this. Yeah. Think of Bayes' rule like a detective updating a suspect list. Every time the detective finds a new clue like, say, the user choosing a slightly more expensive but faster flight, the mathematical probability of the user being purely budget conscious drops. And the probability of them valuing time spikes. 6:01 Exactly. The Bayesian assistant recalculates this exact math with every single click. So Bayesian teaching is like showing the AI student the messy scratch pad of how the mathematician actually updates their guesses, crosses out errors, and re calculates the odds in real time. Rather than just handing them the answer key. Right. 6:20 They force the neural networks to watch and learn from this strict mathematical system. Through a highly specific supervised fine tuning protocol. They trained smaller, highly efficient models, specifically, Gemma two nine b and LAMA three eight b to imitate this exact probabilistic reasoning process step by step. And the results were crazy. Incredible results. 6:42 Oh. By learning to operate under uncertainty in mathematically way evidence, these Bayesian trained models ended up aligning with the mathematically optimal strategy roughly 80% of the time. Completely shattering that one and done plateau. Totally shattered it. And the source material highlights that this wasn't just them memorizing flight data. 7:01 They tested for broad generalization. They took these models, which were, you know, only trained on synthetic flight data with four fears, and dropped them into tasks with eight complex features. They even threw them into entirely different domains like hotel recommendations and simulated web shopping environments using real Amazon product titles. And the models transfer their probabilistic reasoning skills seamlessly because the underlying math of updating a belief doesn't care if you're evaluating flights or buying a toaster. Math is math. 7:32 Exactly. And in some of the web shopping rounds, the Bayesian trained neural networks actually outperformed human participants. Wait. Really? They beat humans at shopping? 7:42 They did. Because humans are not perfectly rational decision makers. We get fatigued. We get distracted by flashy UI. The model just adheres strictly to the mathematical strategy of narrowing down the preference. 7:54 Okay. So we've solved the reasoning problem. We have these newly rational dynamic models that actually learn what you want. But if we want these newly rational models to act as real time assistance, they can't rely on the latency of giant cloud data centers. No. 8:07 They absolutely cannot. Right. Because here's the wall we immediately hit. A Genius AI is completely useless as a real time minute by minute assistant if every single time it updates a probability, it has to send a data packet to a massive cloud server in Virginia, process it, and send it back. The latency makes real time assistance impossible. 8:29 It has to live natively on the device in your pocket. And running a massive language model in a cloud data center is comparatively simple engineering. Oh, yeah. You have racks of massive GPUs, infinite electricity from the grid, terabytes of memory. But taking a generative model and forcing it to run locally on a mobile phone, a device with strict thermal throttling, a tiny battery, and a fraction of the processing power. 8:52 That requires a complete overhaul of the deployment infrastructure. Which brings us directly to Google's release of TensorFlow 2.21 and the graduation of Leenerd out of its experimental stage. Into full global production. Right. Liturgy is officially stepping up to replace the older TensorFlow Lite, and the benchmarks they're reporting are highly specific. 9:12 Google states that Liturgy delivers 1.4 times faster GPU inference on mobile devices right out of the box. And the technical vocabulary to focus on here isn't just the GPU, but NPU acceleration. NPU, so neural processing units. Right. An NPU is a specialized hardware chip built directly into modern smartphone silicon, specifically to handle the complex matrix math that AI requires. 9:35 Because standard CPUs just calculate tasks sequentially one by one. Right? Exactly. But MPUs are physically designed to multiply thousands of numbers simultaneously, which maps perfectly onto the grid like architecture of a neural network. That makes sense. 9:49 So what Lyttar does is introduce a drastically simplified developer workflow that abstracts the hardware layer entirely. Developers can now route operations seamlessly across both mobile GPUs and these specialized NPUs without having to rewrite their code for every different Android phone on the market. And alongside Litertute, we're seeing a massive stabilization in the broader TensorFlow ecosystem. TensorFlow core is pivoting hard toward long term support. They're slowing down the endless feature term. 10:15 Yeah. They're locking down security fixes, bug tracking, and solidifying enterprise grade tools like TF dot data for input pipelines, TensorFlow serving, TFX, and TensorFlow Quantum. But even with all that, even with MPUs and literate, a 9,000,000,000 parameter model like the Gemma two we just discussed is physically too large to fit in a phone's RAM. It just won't fit. No. 10:37 So the method they use to shrink it down is a highly technical process called quantization. Quantization. Right. Quantization is the absolute bedrock of edge inference. A neural network is fundamentally just a massive array of numbers. 10:51 The parameters representing everything the AI has learned. And during the training phase in the cloud, these parameters are usually represented as high precision 32 bit floating point numbers. High precision means incredibly nuanced accuracy, but it also requires massive memory bandwidth just to read the numbers. So they have to compress it. Exactly. 11:10 Quantization aggressively compresses those high precision decimal numbers into much smaller formats, typically eight bit integers. Okay. Here's where it gets really interesting. Quantization is basically like packing a digital suitcase. Oh, I like that analogy. 11:26 Yeah. So you're taking a massive, completely uncompressed raw photograph, which represents our high precision 32 bit parameters, and you're converting it into a highly efficient JPEG. A much lower precision format. Right. You lose some of the absolute deepest color detail, but the file size shrinks by 80%, and it still clearly looks like a picture of a dog. 11:46 It retains the core information without completely filling up your hard drive. Exactly. And with TensorFlow 2.21, Google expanded support for much stronger compression algorithms alongside direct native conversion tools for models trained in PyTorch or JAX. You don't have to rebuild the model from scratch to get it onto a phone. But, I actually have to push back on this architecture for a second. 12:08 Okay. Go for it. We just spent the first segment talking about how difficult it is to teach these models highly nuanced real time Bayesian probability distributions. Right. So if we aggressively compress these models via quantization, literally chopping off the decimal points of their mathematical weights to make them fit on a phone, aren't we inevitably degrading the exact probabilistic reasoning we just spent so much compute power teaching them? 12:33 It's a great point. This raises an important question and, honestly, it's the central tension of edge AI deployment right now. When you quantize a model, you absolutely lose raw mathematical precision. The 32 bit decimal nuance is just gone. Right. 12:48 However, researchers have found something remarkable. If the supervised fine tuning for the Bayesian reasoning is robust enough during training, the model's structural approach to updating beliefs actually survives the compression. Oh, wow. Yeah. The network learns the broader pattern of Bayesian updating. 13:05 It learns the pathways of how to shift attention based on new clues. That functional pattern is surprisingly resilient even when the individual numerical weights are compressed down to an eight bit integer. So the precision drops, but the logic remains intact. Exactly. The logic survives. 13:20 So what does this all mean? We've solved the dynamic reasoning problem, and we've built the compression architecture to run it locally on an NPU without destroying the logic. But a genius AI running perfectly on your phone is still useless if it requires you to push all the buttons. It needs hands. It needs hands. 13:38 And that's where the architecture completely shifts from assisting the user to executing tasks autonomously. Which introduces our third breakthrough, ByteDance's newly open sourced framework, Deerflow two point o. Deerflow two point o represents the critical leap from an assistant to an autonomous execution layer. For years, AI tools have behaved strictly as advisers. You ask a question, and it prints a block of text. 14:02 Or maybe it suggests a Python script for you to manually copy, paste, and run. Which is tedious. Very. But Deerflow changes the paradigm by introducing an architecture called the super agent harness. The super agent harness. 14:14 And this harness acts entirely like an AI project manager. Exactly like a project manager. Let's say you give it a highly complex request, like, research the top 10 AI hardware startups, analyze their market share, and build a localized slide deck. Traditional AI would just give you a list of names. Right. 14:31 But the Deerflow main agent actually deconstructs that request into discrete tasks. It then launches multiple specialized sub agents that run-in parallel. All at the same time? Yeah. One agent is out scraping the web for financial data. 14:46 Another agent is running statistical analysis on the numbers the first agent found, and the third is generating the visual charts. And the defining architectural choice of Deerflow two point o is its isolated execution environment. It removes the human user from the execution loop entirely. Totally automated. Right. 15:03 When that data analysis sub agent writes a Python script to parse the startup data, it doesn't print the code to your screen. It executes the code directly inside its own sandboxed mini computer. Oh, that's wild. If the code throws an error, the agent reads the error log, rewrites its own code, debugs it, runs it again, and outputs the final rendered slide deck. So it's the difference between a traditional consultant who hands you a set of blueprints and wishes you luck and Deerflow, which operates like a general contractor who brings their own plumbers and electricians to actually build the physical house for you while you just sit back. 15:37 That's a perfect analogy. And it's incredibly flexible. Mhmm. It utilizes persistent memory across sessions and features a completely model agnostic API. Model agnostic. 15:48 So depending on the specific task, you can plug in GPT four, Cloud 3.5, Gemini 1.5, DeepSeq, or even route tasks to that local quantized llama model running on your phone via leader two. Whichever tool is best for the job. But looking at the mechanics of this, I have another question. Yeah. If you have a system that relies heavily on persistent memory to track highly specific project structures, How does the context actually hold up if you switch models mid project? 16:16 Say the general contractor starts gathering data using Cloud 3.5 in the cloud, but then you hot swap to a local llama model for the local execution phase just to save on API costs. Doesn't the memory of what was already done break when you change the brain? You would think so. But the way the super agent harness manages state is brilliant. The persistent memory isn't actually stored inside the specific LLMs isolated context window. 16:40 Oh, really? No. The harness itself acts as an external structured database. It standardizes all the inputs and outputs into a universal format. Oh, I see. 16:49 So when you hot swap from Claude to a local llama model, the AI doesn't lose its memory because the model agnostic API simply translates the ongoing project state from the external harness database into the specific token format the new llama model requires. The harness holds the memory. Exactly. The LMS just act as interchangeable processing cores that get swapped in and out. It's just modular intelligence at that point. 17:12 But, of course, giving an AI a sandboxed mini computer, persistent memory, and the ability to execute code autonomously introduces a massive enterprise level risk. Massive risk. Because if an AI has the power to execute a Python script to build a slide deck, it inherently has the power to execute a script that wipes a database. The catastrophe risk is real. When an agent has right access to a live system, a hallucination is no longer just a funny text output or a weird image with six fingers. 17:41 Right. A hallucination becomes a destructive command. Exactly. Which brings us to the final piece of the puzzle, the enterprise guardrails, specifically NVIDIA's upcoming platform, Nemo Claw. According to the reports in our deep dive, NVIDIA is expected to unveil Nemo Claw at their massive GTC developer conference in San Jose. 18:00 And this platform is strictly targeted at enterprise deployment. Yeah. And they are already lining up massive integrations. We are talking partnerships with Salesforce, Cisco, Google, Adobe, and CrowdStrike. The security context behind Nemo Claw is crucial to understand the architecture. 18:17 Earlier this year, there was a viral project called OpenClaw created by a developer named Peter Steinberger. Which eventually led to his acquisition by OpenAI. Right? Right. So OpenClaw was an incredibly capable autonomous AI agent, but it was almost immediately banned from internal use by major companies like Meta and restricted by frameworks like Langchain. 18:35 And the reason it got banned is the perfect cautionary tale. It really is. In one very famous incident, a user gave an OpenClaw AI agent a relatively basic task to organize files. The agent hallucinated a command pathway and accidentally mass deleted the user's entire email inbox. Just wiped it out. 18:54 Wiped it. It didn't mean to act maliciously. It just had total right access and no architectural boundaries preventing it from executing a deletion script. And Nemo Claw is architected from the ground up to prevent exactly that. It features built in security and privacy tools explicitly designed to prevent unauthorized executions in a sensitive enterprise environment. 19:16 So it puts up walls? Exactly. It enforces strict zero trust permission boundaries around what the agent's minicomputer can actually touch. It can read the database to build a report, but the architecture physically blocks the agent from executing a command that alters or deletes the underlying data. And architecturally, Nemo Claw is tightly integrated with NVIDIA's Neutron three open source models. 19:38 But alongside the software platform, there are heavy industry rumors that NVIDIA will be revealing a brand new inference chip system, potentially incorporating technology from the hardware startup Grook. Who are famous for their radically designed ultra fast AI inference hardware. Right. But looking at the business strategy here, I actually have to challenge NVIDIA's play. Also. 19:59 NVIDIA's entire multitrillion dollar empire is is built around the proprietary CUDA software ecosystem. Sure. CUDA is the ultimate walled garden. It forces developers to buy NVIDIA GPUs because the software only runs on NVIDIA silicon. Yet the reports state, Nemo Claw is fundamentally a chip agnostic platform. 20:20 It can run on anything. It can. Why on earth would NVIDIA release a platform that runs on competitors' hardware? Aren't they just setting fire to their own massive profitable moat? If you just look at the software, maybe. 20:31 But if we connect this to the bigger picture, it is a staggering, highly calculated maneuver. Okay. Lay it out for me. NVIDIA understands that the next major bottleneck in the AI industry isn't training massive foundational models in the cloud anymore. The next bottleneck is the mass deployment of millions of autonomous executing agents across enterprise networks. 20:51 Right. By commoditizing the agent software layer making Nemo Claw open, free, and completely chip agnostic, They immediately accelerate the mass adoption of enterprise AI. Every company on Earth will start using it to build agents. Because it's free and easy. Exactly. 21:08 Yeah. And what do millions of concurrent autonomous AI agents constantly reading, writing, and executing code required to function. Massive amounts of ultra fast inference compute. Exactly. They're deliberately lowering the barrier to entry on the software side to guarantee an exponential explosion in global demand for inference compute. 21:26 Oh, wow. They're willing to open the software layer because they know, without a doubt, they absolutely dominate the hardware layer required to actually run those millions of agents at scale. Especially if they roll out new ultra fast hardware integrating that rumored GRUC technology. Exactly. It's a play to own the entire execution layer of the future economy. 21:46 That is just brilliant ecosystem design. Mhmm. Well, as we wrap up this deep dive, we have covered an immense amount of highly technical ground today. We moved from fixing how models think to to compressing them, giving them hands, and locking them down. And I wanna leave you, the listener, with a final thought to mull over as you look at all these tools. 22:05 Think about the trajectory of all these isolated technical methods combining into one flow. It's moving fast. It is. We have models that can perfectly infer our true unstated preferences through Bayesian probability. We have the quantization architecture to run those models constantly and silently in our pockets via the leader RT. 22:25 Right. And we have secure agent harnesses like Deerflow and Nemo Claw that allow those models to execute complex multi step actions autonomously. So at what point does the AI stop being a tool that we consciously pull out and use? At what point does it become a fully autonomous proxy that interacts with the digital world on our behalf, predicting our needs, negotiating our flight prices, and executing the purchase before we even formulate the conscious thought to to ask it. We started today by talking about the difference between a static map and a real time GPS. 22:55 But looking at the architecture we just unpacked, it sounds like the real destination we're headed toward is a car that just drives itself to your favorite restaurant before you even realize you're hungry. Thank you for listening in. Subscribe and follow Colaberry on social media links in the description, and check out our website, www.colaberry.ai backslash podcast for more insights like this.