Updated April 30, 2026
0:00 Welcome to Colaberry AI podcast brought to you by Colaberry AI Research Labs and Carl Foundation. I want you to imagine, just a piece of software infrastructure written way back in 1998. Oh, wow. Yeah. Vintage. 0:13 Right. And it has been audited. It's been patched, scrutinized by human cybersecurity experts for, you know, nearly thirty years. It is considered completely locked down. Bulletproof, basically. 0:25 Exactly. Now imagine an AI model acting completely alone. It compiles the software, runs its own debugging tools, forms a hypothesis, and then just tears that impenetrable code wide open. And all for what? The cost of a $50 bar tab? 0:39 Exactly. $50. I mean, it is fundamentally paradigm shifting moment. Mhmm. And it is exactly why we need to look under the hood today. 0:46 Absolutely. So today, we are taking a deep dive into the absolute bleeding edge of autonomous AI and robotics. And I wanna be clear right up front, our mission today is to completely bypass the surface level hype. Yeah. We are getting incredibly technical today. 1:00 We really are. We are looking strictly at the methods, the benchmark results, and, you know, the specific architectures making this possible based on the recent technical tear downs. So whether you're a developer looking for new workflows to try right now or you're just insanely curious about how a machine autonomously finds a twenty seven year old software vulnerability. Which is wild. It is wild. 1:23 You're gonna get the exact mechanisms behind the magic. Because, you know, to understand how physical robots are suddenly navigating the real world with such precision, we only have to start by looking at the massive leap in the software reasoning layer. Right. Specifically, we need to examine how these architectures are operating now. They aren't just reactive text generators anymore. 1:44 They are operating as long horizon autonomous agents. Okay. Let's unpack this because the clearest example of this software leap is Anthropic's new model, Claude Mythos. And the method here is what's truly mind bending to me. Oh, the methodology is fascinating. 2:00 Right. Mythos doesn't just read a static block of code and probabilistically guess where a vulnerability might be. It actually acts like a senior security engineer. It compiles the software. It executes it. 2:11 Yeah. And it actively uses dynamic debugging tools. Mhmm. Like, I know it specifically utilizes things like address sanitizer, which, for those listening who might not be deep in C suite debugging, it's essentially like injecting a glowing UV dye into a plumbing system. That's a really good way to put it. 2:28 Yeah. It lets the system watch exactly how memory is being allocated and precisely where it leaks or breaks. And then Mythos uses that feedback to form hypotheses. And then it autonomously chains those vulnerabilities together over really long time horizons. Contrast that with, you know, traditional fuzzing techniques. 2:45 Oh, right. The standard fuzzers. Yeah. A standard fuzzer essentially just floods a program with massive amounts of random malformed data just to see what crashes. Right. 2:54 It's basically like throwing massive amounts of spaghetti at a wall to see what sticks. Whereas Mythos is acting like a forensic accountant, meticulously reading the underlying recipe to find exactly where the math breaks down. That is a perfect analogy. Right. Because traditional fuzzers, they lack semantic understanding. 3:13 I mean, they are great for surface level crashes, but they really struggle with deeply buried logic heavy flaws. Right. Mythos combines both approaches. It reasons about the entire code base to pinpoint the structural weakness, and then it autonomously writes and runs targeted custom experiments to prove its own theory. And the results from this method are just staggering. 3:33 I was looking at the benchmark data. And on Cyberjimi, which measures how well an agent can reproduce known vulnerabilities, Mythos scored 83.1%. That's huge. It is. And on SWE verified, which is this incredibly rigorous software engineering benchmark, it hit 93.9%. 3:52 But, the specific exploits are what actually made me stop and do a double take. The OpenBSD one. Right? Yes. It targeted an integer overflow bug in OpenBSD. 4:01 Which is an operating system absolutely famous for being obsessively security hardened. Exactly. And this bug dated back to 1998. The model found that an integer overflow, which is basically when a computer's internal counter gets pushed past its maximum limit and rolls back to zero, kinda like an old mechanical car odometer rolling over Right. Yeah. 4:21 That led directly to a null pointer right, meaning the system got confused by the rolled over math and tried to write critical data to an empty void in memory allowing for remote crashes. And Mythos successfully found and this for about $50 in compute. And it went even further with FreeBSD. Right. Yeah. 4:40 It targeted a remote code execution vulnerability in FreeBSD and built an exploit chain by splitting 20 instruction fragments into six separate network requests. It's just insane to think about it doing that autonomously. I know. It used this to construct a ROP chain, a return oriented programming chain with zero human intervention. And for anyone who isn't a security researcher, a ROP chain is is incredibly complex. 5:06 Yeah. Imagine a ransom note made by cutting out individual harmless letters from a magazine. Mythos essentially scoured the system's existing memory, found 20 perfectly harmless snippets of code that were already there, and perfectly stitched them together to execute a malicious command. Yeah. But I do have to push back here for a second. 5:27 If this architecture is this incredibly efficient at cyber offense, Aren't we just handing a fully automated weapon to anyone with $50 and an API key? Well, this raises an important question, and it is exactly why Anthropic is holding the model back from public release right now. Right. Makes sense. To synthesize what they are doing defensively, they launched a counter initiative called Project Glasswing. 5:49 They are actually distributing this Mythos architecture to Defender's first organizations managing massive infrastructure like AWS, Cisco, the Linux Foundation. Giving them a head start, basically. Exactly. Because if you think about the underlying economics of this, the cost collapse of finding bugs fundamentally breaks the old model of cybersecurity. For decades, critical systems were relatively safe simply because finding a deep zero day vulnerability required rare human expertise. 6:19 Right. Massive funding, months of time. Exactly. When a server farm can spin up an agent that pulls a twenty seven year old bug out of heavily audited code for $50, the defensive side has to automate their patching just as fast, or the asymmetry just becomes completely unmanageable. Absolutely. 6:36 So while Mythos is currently locked behind those enterprise defense initiatives, there are highly technical agent workflows that developers listening right now can try today to optimize their own environments. Yeah. We definitely wanna focus on things you can try right now. Right. For instance, if you are using Anthropic's terminal based Claude code tool, you have to try enabling no flicker mode. 6:56 You literally just go into your environment settings and set the variable Claude underscore code underscore no underscore flicker equals one. Yes. And understanding the method behind that toggle is a really great lesson in interface integration. It is a massive quality of life upgrade for developers using terminal based multi agent workflows. Oh, completely. 7:17 Previously, these tools use a traditional rerendering approach. Right. Where the terminal is blasting standard output line by line, and the operating system is constantly fighting to redraw the entire terminal UI as fast as the AI streams the code. Exactly. It causes intense annoying flickering on the screen, but more importantly, it causes massive CPU and memory spikes during long complex coding sessions. 7:44 Yeah. It just melts your machine. Yeah. So no flicker mode fundamentally changes the rendering method. It replaces that constant redrawing with a full screen buffer. 7:53 It operates similarly to legacy terminal tools you might be familiar with, like VIM or htop. Meaning, it only updates the specific visible characters that are changing on the screen rather than refreshing the entire window from top to bottom every single millisecond. It completely stabilizes your local compute when the agent is streaming massive architecture changes. It's such a clean technical fix. It really is. 8:17 And over in the browser space, we are seeing similar workflow optimizations. Google Chrome just rolled out skills, which is essentially browser level prompt templating. You can build and save these multi tab data retrieval pipelines and execute them natively right in the browser environment. That's super useful. But what I'm really geeking out over and where I think the architecture is making the biggest leap are the new vision action models. 8:41 Oh, the visual processing architecture is definitely shifting dramatically right now. Take z dot ai's new GLM five v turbo model for example. Yes. Because the idea of giving an AI a raw, messy screenshot of a completely broken user interface and having it directly write the front end code to fix the styling is just incredible. It is. 9:01 And the underlying method is why it actually works now. Z dot ai uses a Cog VLM vision encoder combined with MTP or multi token prediction. Okay. Break that down for us. So historically, if you gave a model an image, it would use an encoder to generate a rough text description of that image, and then the language model would try to reason based on that text translation. 9:22 Which is a massive bottleneck. A huge bottleneck. This new architecture skips the text bottleneck entirely. It processes complex UI layouts, design mock ups, and dense graphical documents directly within a 200,000 token context window. Wow. 9:38 And Alibaba's q 3.6 plus takes that method even further utilizing a 1,000,000 token context windows specifically tuned for repository level engineering. Yeah. A million tokens is staggering. But I have to ask about the mechanics here. How exactly do these models avoid losing the plot? 9:57 If you feed an architecture a million tokens comprising an entire code base and high resolution UI screenshots, how does it not just turn into a garbled mess of data? Well, if we connect this to the bigger picture, it really comes down to how the architecture handles embeddings. It preserves fine visual details and layout structures by maintaining the strict spatial relationships within the embeddings themselves. Instead of just flattening it? Exactly. 10:20 Instead of flattening the data into a linear sequence. Ah, okay. So instead of taking a painting and just giving the AI a sequential list of the colors used, which is flattening it, the model actually retains a geographic coordinate based map of exactly where every single brush stroke is located on the canvas. Precisely. And this spatial retention is crucial because we are moving away from isolated chat models where you just ask a question and get a text answer. 10:46 We are transitioning to always on agent environments. Like the Conway instances. Yes. Ancerovic is currently testing what they call Conway instances. These are persistent environments where the agent essentially lives in the background of your system. 10:59 Just waiting to be useful. Yeah. It is triggered by webhooks. It has its own dedicated file systems, and it manages its own working memory. These massive structurally intact context windows allow the model to retain the entire history of its actions, the tools it has used, and the exact visual states of the UI it has interacted with without that data degrading over time. 11:20 Here's where it gets really interesting, though. Because the software layer is clearly mastering digital interchanges. I mean, it can read the screen, write the code, and debug the memory. But translating that incredible intelligence into the physical world requires completely new hardware paradigms. Oh, absolutely. 11:39 We are seeing researchers completely abandon traditional rigid motors and gears for some radically new, highly technical methods. That's right. Because no matter how brilliant the embodied reasoning software gets, traditional robotics consistently hit a physical wall with their actuators. Researchers at Princeton are tackling this right now by building heat driven robots utilizing a fascinating material science method. They are using liquid crystal elastomer. 12:05 Right? Yes. Exactly. And the manufacturing method is brilliant. They use a highly customized three d printing process to create pattern zones inside the material that essentially act as built in hinges. 12:16 Yep. It's like printing a synthetic muscle that inherently knows exactly how and where to flex when it gets warm rather than building a clunky metal skeleton and attaching complex pulley strings, gears, and servo motors to it. And it's so much more elegant. It really is. The physical movement is driven entirely by thermal energy triggering those specifically printed zones, and the whole system is monitored by closed loop temperature sensors so it knows exactly exactly how far it is bent. 12:43 And, you know, the why behind this method is durability. Rigid systems inevitably fail in extreme or unpredictable environments. Gear strip out, metal bends, electric motors burn up when they meet unexpected resistance. Right. They just break. 12:57 Exactly. But these programmable materials fundamentally solve the wear and tear problem that constantly plagues traditional soft robotics. And if programmable thermal materials aren't wild enough, let's talk about the biological method, the neurobots. Scientists literally integrated living neural precursor cells into frog cells. I mean, that sounds like science fiction. 13:18 Doesn't it? Yes. Instead of relying on mechanical engineering or thermal dynamics, they are quite literally growing a biological nervous system. They provided a biological scaffolding, and the neural cells autonomously formed complex networks inside the biological construct. And the results they measured were wild. 13:38 They applied specific chemical drugs that alter neural communication, and they watched as it directly mechanically changed the physical movement of the robot. Incredible. Even crazier, they observed genetic expression linked to visual system development. We are talking about biological robotic constructs that might eventually grow their own organic sensors. Oh. 13:56 Oh, and to round out the experimental hardware, we also have to mention the new HAARP actuators. These are incredibly flexible air powered structures that mimic mammalian muscle tissue, and they can lift 100 times their own weight using just tiny, precisely routed amounts of air pressure. But, you know, while biological, air driven, and heat driven robots are incredible experimental methods that are pushing the boundaries of what a physical robot even is, the industrial manufacturing world is taking a very different approach. Right. They are scaling humanoids right now, today, by solving the hardest traditional hardware bottleneck, which is the economics of the actuators. 14:35 Let's talk about Hyundai's new Atlas model. Because actuators, the electric joints and drive motors typically make up roughly 60% of a humanoid robot's total material cost. It's by far the biggest expense. Yeah. They require high torque at low speeds, which usually means custom incredibly expensive gearboxes. 14:53 Every robotics startup has been banging their head against the wall trying to figure out how to make them cheap and durable enough for mass production. Mhmm. And Hyundai solved this manufacturing method simply by adapting the electric power steering systems they already mass produce by the millions for their cars. It is an absolute master class in supply chain leverage. They took the exact same core setup, the electric motors, the heavy duty gearboxes, the torque sensors that you'd find in a family sedan's steering column, and just repurpose them as robotic joints. 15:22 And the result is that it gives the new Atlas 56 degrees of freedom and joints that can literally rotate a full 360 degrees indefinitely without tangled wires. That's wild. But you are right. The hardware is only half the equation here. The software split method used by Google DeepMind in their Gemini Robotics ER 1.6 architecture is what actually makes the hardware autonomous. 15:45 Exactly. DeepMind realized that trying to have one giant neural network handle both high level thinking and micro level balancing was incredibly inefficient, so they separated the architecture. Makes sense. The VLA or vision language action model is basically the spinal cord and cerebellum. It handles the direct split second physical control, proprioception, and balance. 16:06 But the ER, the embodied reasoning model, acts as the frontal lobe. Oh, okay. It acts as the high level strategist. It continuously monitors the environment, understand the goal, and plans the sequential tasks, feeding those commands down to the VLA. And the results of this architectural split, specifically with their new agentic vision method, are huge. 16:28 Let's say the robot needs to read an old analog pressure gauge on a factory floor. Previously, a model would just look at the whole image and guess. Right. Which wasn't very reliable. No. 16:38 Not at all. Now the model actively runs code to digitally zoom into the specific pixels of the image. It runs a script to calculate the geometric proportions and angles of the physical needles and then applies its broader world knowledge to deduce the exact pressure reading. That's so smart. This targeted agentic method pushed their success rates for instrument reading from sixty seven percent on the old Gemini three point o flash architecture up to an incredible 93%. 17:05 Huge jump. It is. So I have to ask you. If a massive car company has fundamentally solved the hardware cost via their existing supply chains and a massive search engine has solved the embodied reasoning via split architectures, Yeah. What is actually left to hold this back? 17:20 Are we genuinely going to see these models replacing human factory workers tomorrow? Well, what's fascinating here is the closed loop economic reality that is currently unfolding. Hyundai is not just building a few cool prototypes for research labs. You know? They are actively building a $26,000,000,000 robotics ecosystem. 17:39 26,000,000,000? Yeah. They are effectively creating highly automated factories that build robots whose sole purpose is to go work in their other highly automated factories. Wow. And it's not just the cheap physical hardware that makes this eyeball at scale, it's the back end fleet management systems, like Boston Dynamics' new Orbit platform. 17:59 If you have 10,000 humanoids in a factory and one Atlas encounters a weirdly shaped box and learns a new physical manipulation technique to pick it up, the Orbit platform takes that newly learned neural weight and shares it instantly across every single other robot in the fleet. So they all learn it at once? Exactly. The rate of scaling is truly exponential. Man, so what does this all mean? 18:19 If a purely software based AI model can autonomously act as a forensic accountant, compile code, run debuggers, and chain together a 27 old structural bug for $50, and a physical How long until these combined systems begin begin rewriting their own physical and digital consoles completely without us? That is the million dollar question. It really is. It's definitely something to think about the next time you hear about an AI update. Thank you for listening in. 18:55 Subscribe and follow Colaberry on social media links in the description, and check out our website, www.colaberry.a I backslash podcast for more insights like this.