Updated March 5, 2026
0:00 Welcome to Colaberry AI podcast brought to you by Colaberry AI Research Labs and Carle Foundation. Great to be here. Today, we're doing a technical deep dive. We're looking into Google Cloud AI platforms, specifically focusing on Vertex AI. Yeah. 0:13 And it's generative AI capabilities. It's a pretty hot area. Absolutely. We've got some interesting source material, the intro docs for Google Cloud AI and Vertex AI, plus a news piece on Google's AI Studio for, text to video. Right. 0:27 That text to video stuff is definitely cutting edge. So our mission today is really to break down Vertex AI's functions, you know, get into details of building and deploying Gen AI apps. And also explore the technical side of how it actually generates media like video. Exactly. Okay. 0:42 Okay. Let's kick things off. Vertex AI, it's positioned as this unified platform on Google Cloud. That's the core idea. It's built to streamline the whole workflow for foundation models, which can get, pretty complex. 0:54 Streamline how exactly? Well, Google's aiming for an end to end environment. So from the very first experiments, the prototyping Oh. All the way through to actually deploying these generative AI models into production. And I saw they have a free trial. 1:07 They do. For new customers, it's ninety days free access to the generative AI tools. It's a good chance to, you know, play around with it. Definitely a good way for folks to get their hands dirty. So let's dig into the core features. 1:19 Access to Gemini models is a big one. Huge. And these aren't just your standard text models. The key thing is their multimodality. Multimodality meaning they handle different types of input. 1:31 Exactly. Gemini is designed from the ground up to natively process and understand text, images, video, even code. All at once. Well, it integrates the understanding. This allows for more sophisticated reasoning and generation that can span across these data types. 1:48 Any examples of that? Sure. The docs mentioned things like pulling text straight out of an image Okay. And then immediately structuring that text into, say, a JSON format. Or you upload an image, and it can answer questions about the image content. 2:02 Wow. Okay. That's technically quite sophisticated. You can see the applications there. Definitely. 2:07 It opens up a lot of possibility. Then there's the model garden. Over 200 models? Sounds like a lot. It is. 2:12 It's basically a central hub. You've got Google's own proprietary models, open source ones, and even models from third party vendors. All accessible how? Through APIs? Yep. 2:23 Via APIs. Makes integration much simpler. And they also highlight Gemma. Gemma. It's a family of lightweight open models. 2:31 The interesting part is they're based on the same tech as Gemini. So kinda democratizing that advanced tech. Sort of. Yeah. Makes powerful AI more accessible even if you don't have massive resources. 2:42 Let's developers pick and choose based on performance needs or constraints. Makes sense. What about prompt design? The docs mentioned a chat interface. Uh-huh. 2:50 Sounds more user friendly? It is. It's part of Vertex AI Studio. You can experiment with prompts, see what works, and you can adjust parameters. Like temperature. 2:58 What does that control, technically speaking? Right. Temperature controls the randomness in the model's output. Lower temperature means more predictable focused answers. More deter Exactly. 3:09 Higher temperature lets the model take more chances, potentially leading to more creative or, well, unexpected results. So you can kinda dial in the creativity level. That's the idea. Fine tuning the behavior for your specific application. Okay. 3:23 But what if the base models aren't quite enough? That's where model tuning comes in, using your own data. Precisely. Tuning lets you take a foundation model and train it further on your specific data to optimize it for your task. How does that work under the hood? 3:38 What are the methods? Vertex AI offers a few advanced techniques. There's adapter tuning. Adapter tuning. Yeah. 3:44 It's parameter efficient. You add small trainable modules to the existing model so you only update those, not the whole thing. Saves compute. Clever. What else? 3:53 There's reinforcement learning from human feedback RLHF that uses human preferences to guide the tuning. The human in the loop approach. Right. To make outputs more helpful or harmless. And for images, you have style tuning or subject tuning to get specific looks or represent certain objects better. 4:12 That's some granular control. Now connecting these models to the real world, external data services. That's Vertex AI extensions. Correct. Extensions are like managed connectors. 4:25 They let the models reach out to, say, your company's database or a third party API. How does that connection work technically? They handle the query translation, routing, authentication, getting the data back to the model, all that plumbing. So the model can use real time info or even trigger actions. So the AI isn't just generating, it's interacting. 4:44 Very powerful. And deploying these, MLOps integration sounds key. Absolutely critical for production. Vertex AI uses managed endpoints. They abstract away a lot of the infrastructure headaches. 4:54 Like scaling and resource management. Exactly. Provisioning, optimizing for inference, it's handled. Integrating into an app might just be a few lines of code calling an API endpoint. Even without deep ML knowledge. 5:06 That's the goal. Plus, you get tools for monitoring, versioning, governance, the whole MLOps life cycle support. And the big one, security, Especially with proprietary data for tuning. Yeah. Google emphasizes this. 5:19 Crucially, your tuning data, your prompts, the weights of your tuned model, none of that is used to train the original foundation models. So your data stays yours. Right. And the tuned model artifact lives within your specific Google Cloud environment. Access is controlled via standard Google Cloud IAM. 5:36 That separation is vital for enterprise trust. Okay. So Vertex AI Studio is the main interface for all this, the console tool. It's the central workbench. Yeah. 5:46 A web UI for exploring, prototyping, testing, and they offer free introductory training too. What does that training cover technically? The basics, really. Yeah. Navigating the console, using the prompt interface, kicking off tuning jobs, deploying to endpoints gives you a practical foundation. 6:02 Good starting point. Let's dive into the studio workflows. Building with Gemini, how does that look technically? Inside the studio, you're essentially interacting with the Gemini API endpoints through that UI. You craft your prompt, which might include text, code, maybe an image. 6:18 The studio bundles that up, sends it to the model, Gemini processes it all, generates the response, and the studio displays it back to you. And you can tweak the prompt right there. Yep. Iterate, change parameters, see how the model reacts to different kinds of multimodal input. And prompt engineering is key here. 6:34 Getting the structure right. Definitely. Good prompts mean clear instructions, maybe specify the output format like JSON, providing context. Especially with multimodal inputs, making sure they work together. Right. 6:47 The studio is where you experiment to figure out what prompt structures yield the best results for your specific tax, text, code, whatever. Makes sense. We talked about tuning concepts, but how do you actually do it in the studio? What are the settings? Okay. 7:00 So you start by picking your foundation model, then you provide your training dataset, usually input output pairs. Right. Then you configure hyperparameters, things like learning rate, batch size, number of training steps, or epochs. Standard ML training parameters. Pretty much. 7:17 Learning rate controls how much the weights change. Batch size is how many examples per iteration. Step CPOCs is how long you train. Finding the right values often takes some trial and error. And you can monitor this. 7:28 Yeah. The studio lets you track the job's progress and evaluate the tuned model's performance on a validation set. Which brings us to evaluation. How does the studio help assess if the model's actually any good? It integrates a GenAI evaluation service. 7:43 You define your evaluation dataset and the metrics you care about. Metrics like For text, maybe b l e u or r g scores, perplexity. For images, perhaps FID or inception score to check quality and diversity. Okay. The studio runs the model against your evil data, calculates these metrics, and shows you the results. 8:01 So you can compare models or different versions of your your tuned model and optimize. That iterative loop. Tune, evaluate, refine. Exactly. It's crucial for getting models production ready. 8:11 Okay. Let's switch gears a bit to that news piece. Vertex AI Media Studio, text to professional quality video. That sounds ambitious. It is ambitious, but the tech is evolving fast. 8:23 It builds on Vertex AI's foundation but integrates a whole pipeline of specialized models. A pipeline. Okay. Break down that technical workflow for us. What models are involved? 8:33 Right. It starts with image in three. That's Google's image generation model. You give it a text prompt. And it generates the starting image. 8:40 Right. High quality image based on the text. Then that image gets fed into v o two. V o two is the video generation part? Yes. 8:48 It takes the still image and turns it into a video sequence, but it's more than just animating it. How so? Technically, v o two lets you control things like virtual camera movements, simulate a drone shot, pan across scene. Yeah. You can also set the frame rate, the duration. 9:03 So quite a bit of directorial control almost. Getting there. It even has a magic eraser feature to remove unwanted stuff from the video frames. Okay. That's a visual side. 9:11 What about audio? For voiceovers, it uses chirp, a text to speech model. You give it a script, it generates the voice. Natural sounding. That's that's the goal. 9:20 And for background music, it uses Lyria. Lyria, developed with YouTube. Yeah. By DeepMind and YouTube. It generates music tracks based on text descriptions, mood, style, etcetera. 9:32 Wow. So image in three for the picture, v o two for motion control, chirp for voice, lyric for music. Exactly. It's a chain of specialized models working together. And this all happens within the Vertex AI Studio workspace. 9:46 That's the key integration point. You manage the whole process prompt, image gen, video gen, adding audio from that one interface. It handles passing the data between models. Streamlining that complex workflow. Pretty much. 9:58 Avoids juggling multiple separate tools. The article stressed accessibility for developers and nontechnical users. How does the tech enable that? It's partly the UI design and the studio visual controls natural language prompts, but also the underlying platform app stracks away a lot of the complexity. Hiding the complicated bits. 10:16 To an extent. For developers, the APIs are still there if they want deeper integration or programmatic control. And having Gemini in the same environment helps too for text, image, code. Absolutely. That underlying multimodal capability supports various creative tasks, making it powerful regardless of technical depth. 10:35 It lowers the barrier to entry for content creation. This has been, really insightful, a deep dive indeed into Vertex AI. Google's clearly built a very comprehensive, technically layered platform. For sure. From foundation models like Gemini to specialized media tools like Imagen, Veo, Chirp, Lyria, all integrated with MLOps. 10:55 It's a powerful offering. Covers the whole life cycle, seems like. That's the aim. It's a serious toolkit for anyone wanting to leverage generative AI. Thank you for listening in. 11:03 Subscribe and follow Colaberry on social media links in the description, and check out our website, www.colaberry.a I backslash podcast for more insights like this. So thinking about all this, the technical depths, the integration of Gemini, Imagen, Veo, Chirp, Lyria within Vertex AI Yeah. As these platforms keep evolving, offering even more fine grained control over multimodal outputs, what new kinds of content creation or even application development paradigms do you think might emerge? What's the next big shift this enables?