Updated March 5, 2026
0:00 Welcome to Colaberry AI podcast brought to you by Colaberry AI Research Labs and Carl Foundation. Today, we're undertaking a, detailed examination, a real deep dive into Google's latest advancements in AI video generation. We'll be focusing specifically on VO three. That's their new video model with interestingly integrated audio and also Flow, which is a new filmmaking tool designed to work with models like VO, Imagen, and, Gemini. So the goal here is really to unpack the technical details of these tools, what makes them tick, and importantly, what they might mean for the future of how video gets made. 0:36 We'll definitely get into the methods, the results, using some of the, you know, technical vocabulary where it makes sense. Absolutely. It's a really interesting convergence, isn't it? You've got video, image generation, language models Mhmm. All kinda coming together in one creative pipeline. 0:49 That integration itself is, I think, a big step. Okay. So let's start with VO three then. It's pitched as more than just your standard text to video. Correct? 0:57 That native audio generation seems like a pretty major development. Precisely. V o three is, Google's first swing at a video model with that built in audio synthesis. But beyond that, the specs really emphasize high fidelity sort of cinematic output. A key thing they're highlighting is better adherence to prompts. 1:16 So it understands your text description better Mhmm. And translates that into both the visuals and the sound. And there's also a big focus on, well, physics simulation and photorealism, making things look and move like they would in the real world. That suggests some pretty sophisticated stuff under the hood. Right. 1:31 Like complex rendering and understanding object interactions. Exactly. Which brings us neatly to Flow. This sounds like the bridge, the tool to actually use these powerful models for creative work. That's the idea. 1:43 Flow is explicitly framed as this new AI filmmaking tool built, as they say, by and for creatives. So the design thinking seems centered on filmmakers' needs, their their actual workflows. Its main job is to help build complex stories using these AI generated clips and scenes. And, technically, what's really interesting is how it works with Veo, Imogen, and Gemini. Suggest a kind of modular approach. 2:05 And you mentioned Gemini, its role in processing prompts sounds key for making it user friendly. Does that mean we're moving towards just describing what we want in plain English? That seems to be the direction. Yes. Using Gemini's natural language smarts. 2:21 CLO aims to make prompting way more intuitive. So you can describe your vision, you know, using everyday language and the models behind the scenes figure out how to generate the visuals and audio. It lowers the barrier to entry, letting creators focus more on the, the art, less on the technical commands. Okay. Let's get into the specifics within Flow then. 2:39 Camera controls, that sounds pretty important for getting that cinematic feel. Oh, definitely. The camera controls feature gives you really granular control over the virtual camera. We're talking about defining specific motion paths like virtual dolly shots, pans, tilts, zooms. You can set specific camera angles, viewpoints, even simulate different lens types, focal lengths, depth of field. 3:00 It's that direct manipulation that lets you design shots precisely much like traditional filmmaking, gives you more expressive power. And what about scene builder? That sounds like it tackles a big challenge, keeping things consistent over over time in AI video. It does. Scene Builder is really aimed at that temporal consistency problem. 3:18 It lets you, say, extend a shot, add more frames that logically follow what came before, or maybe reveal more of the scene. And crucially, it helps you transition between different shots while trying to keep things smooth, maintaining continuous motion, consistent characters, consistent environments. That likely involves some, pretty complex temporal modeling, maybe tracking features across frames, that sort of thing. Makes sense. Then there's asset management. 3:45 Seems straightforward, but I imagine essential for bigger projects. Absolutely essential. Asset management is basically the organizer. It keeps track of all your project ingredients. That could be stuff you upload yourself or things you generate right inside Flow using Imagen for images like custom characters or backgrounds, and it also manages the text prompts you're using. 4:04 Having all that organized is just crucial for keeping things consistent and iterating on ideas. Right. Keeping track of all the bits and pieces. And Flow TV, what's the purpose there? Flow TV sounds like a curated showcase, really. 4:17 It shows off content made with VO and Flow. But the key part from a learning perspective is the transparency. It apparently shows the final video and the exact prompts and techniques used to make it. So you can see what works. Right? 4:30 You can learn effective prompting, maybe adapt techniques for your own stuff, like looking over someone's shoulder. That's actually pretty useful. Okay. Let's dig a bit more into the technical side, the integration. How exactly is Flow built to work with Veo, and where does Imagen fit into that picture? 4:46 Well, Flow seems to act as an orchestrator. It provides the interface that translates your creative ideas into the specific inputs that Veo's generative video model needs, controlling motion, style, the content itself. And Imagen's text to image power is integrated directly within Flow. This lets you create those custom visual ingredients we talked about, characters, objects, scenes without leaving the tool. Then those Imagen generated assets can be pulled straight into VO to be animated in a video sequence. 5:13 Offers a lot of flexibility. And you mentioned consistency. How does Flow ensure that, say, a character generated with Imagen looks the same across different Veo clips? That's a critical point. Flow has mechanisms built in to ensure that once you create an asset, like a character, you can consistently reference and use it in multiple clips generated by Veo within that same project. 5:35 It also lets you use a static image from Imagen, maybe a background scene, as the starting point for a new animated shot with Veo. This helps build that coherent visual world across a sequence. Technically, it probably involves using visual embeddings or features from the asset to guide the video generation ensuring that consistency. Okay. Now let's circle back to the native audio in VO three. 5:56 That feels like significant leap for multimodal AI. How does that work technically? Yeah. It's a big deal. V o three generating audio natively means sound is synthesized as part of the video creation process itself. 6:10 It's not just layering audio on top later. The model can apparently generate environmental sounds that match the scene, wind sounds for a forest, street noise for a city, and it can even generate character dialogue based on the context you give it in the prompts. This simultaneous generation suggests a really sophisticated model underneath. One that's learned joint representations of vision and sound, probably using attention mechanisms to link visual elements to their corresponding sounds, making sure they're synchronized and make sense together. But, again, it's important to note, this specific feature, the native audio, is currently tied to the Google AI Ultra plan. 6:45 Right. The top tier. So speaking of access, who can actually use these tools right now? Okay. So Flow is basically the evolution of an earlier experiment called VideoFX. 6:56 Currently, Flow is available in The US for people subscribed to either the Google AI Pro or Google AI Ultra plans. Google has said they plan to expand it to more countries, but it's US only for the moment. And what's the difference between those Pro and Ultra plans in terms of what you get with Flow? The Google AI Pro plan gives you access to the main Flow features, and you get a limit of, a 100 video generations per month. The Google AI Ultra plan gives you higher usage limits. 7:22 They don't specify exactly how much higher, but more. And crucially, it includes that early access to v o three with the integrated native audio. So it's a tiered system based on usage needs and access to the cutting edge features. It's also interesting that Google is actively working with filmmakers on this. What's the thinking behind that approach? 7:39 It seems pretty strategic. By collaborating directly with filmmakers, Google gets real world feedback on how these tools actually fit into or maybe change creative workflows. They wanna make sure the tools are genuinely useful for professionals, not just technically impressive in a lab. So, giving early access to people like Dave Clark and Reid Dobre Junilao who made short films using these tools helps Google understand the strengths, the weaknesses, and how AI can genuinely support complex storytelling. It shows what's possible now and where the boundaries still are in practical filmmaking. 8:12 So if we were to summarize the key technical steps forward here, what stands out from v o three and Flow? I'd say the major advancements are in that deeper multimodal integration, especially the native audio in v o three. That's a big one. Also, the improvements in prompt understanding and the focus on temporal coherence and physics simulation in the video generation. Flow represents progress in creating a more intuitive filmmaker centric interface with features like detailed camera control and asset management designed for actual production needs. 8:42 The results we're hearing about from those filmmaker collaborations suggest these tools can indeed produce high quality cinematic content more efficiently, potentially opening up new creative avenues. That combination of improved realism, better control via flow, and the audio integration, it really does feel like a significant step up in AI video tech. It definitely feels like we're watching the next phase of filmmaking tools starting to emerge. I think so. You have these incredibly powerful generative models meeting more user focused creative interfaces. 9:10 It's really unlocking some unprecedented ways to tell stories with visuals and sound. And, of course, it raises all sorts of interesting questions about AI's future role in the whole creative process, how media gets made. Lots to think about. Thank you for listening in. Subscribe and follow Colaberry on social media links in the description, and check out our website www.colaberry.ai backslash podcast for more insights like this.