Updated March 5, 2026
0:00 Welcome to Colaberry AI podcast brought to you by Colaberry AI Research Labs and Carl Foundation. Today, we're diving deep into Gemma three, Google DeepMind's latest family of open models. Yeah. Exciting stuff. These models are well, they're built with the same core tech behind Google's Gemini two point o. 0:17 So it gives developers a really interesting chance to work with some cutting edge AI, but, you know, more accessibly. Absolutely. And our goal today, really, is to unpack what makes Gemma three significant, especially from a technical angle. Wanna get into the nitty gritty for developers thinking about using these. Like, what can you actually do with Gemma three? 0:36 Exactly. We'll be looking hard at the specs, the architecture, or what we can figure out from the release and, how these choices impact performance and deployment in the real world. Okay. So, Gemma three, they're calling it a collection of lightweight but still state of the art open models. That sounds like a sweet spot for developers worried about efficiency. 0:54 It really is. The focus on lightweight is key. It means these models are designed to run well on lots of different hardware. We're talking big setups like multi GPU or TPU clusters, sure, but also single GPUs, even, you know, laptops and phones. That opens up a lot of possibilities. 1:10 And the different sizes, one b, four b, 12 b, 27 b parameters, that gives developers quite a range, doesn't it? How should you think about picking the right one? Well, the parameter count is basically about the model's capacity. Think of it as, like, how much it knows and how complex its reasoning can be. The smaller ones, one b and four b, they're faster, use less compute, great for quick responses, or if you don't have massive hardware. 1:34 But the bigger ones, 12 b and 27 b, they generally give you better accuracy, handle tougher tasks. It just costs more in terms of resources. Right. So it really boils down to your specific needs and what hardware you've got access to. And the connection to Gemini two point o technology, that's pretty significant. 1:50 Can we, infer much about the underlying architecture from that? It's a strong hint. While Google hasn't laid out all the blueprints, the Gemini connection points heavily towards a transformer based architecture. That's sort of the standard for top language models now. Given Gemma three's focus on efficiency and performance, you can probably bet there are refinements. 2:10 Optimizations to attention mechanisms may be clever staling tricks learned from Gemini. Okay. Let's dig into performance. They're reporting it outperforms other models in its size class, even heavy hitters like llama three four zero five b on, the El Marina leaderboard. What's the technical takeaway there? 2:29 Yeah. That's impressive. Outperforming bigger models on human preference scores like El Marina, especially given its parameter count, suggests really high parameter efficiency. Basically, the model is doing more with less. The architecture, the training, it's all working together very effectively. 2:44 So for a developer? For you, it could mean getting similar or maybe even better results without needing as much expensive hardware, lower costs, smaller memory footprint. The fact their 27 b model ranks well while needing fewer top tier GPUs than some competitors really drives that efficiency point home. Multilingual support is another big feature. Native support for over 35 languages, pre trained for over 140. 3:06 How did they usually build that kind of language breadth into a model? Typically, it starts with the pre training data. You need a massive diverse dataset covering all those languages. This lets the model learn, you know, shared linguistic patterns, stuff that works across languages. The out of the box support probably means the tokenizer, how it breaks text into pieces, is well suited for those 35 plus late, maybe some architectural tweaks too. 3:28 And the one forty plus pre trained means there's a solid foundation you can then fine tune for even more languages if needed. And what about multimodality? Handling text, images, short videos in the four b, 12 b, and 27 b models, How does that integration work under the hood? Usually, you have separate encoders for each modality. So for vision, maybe a CNN or a vision transformer to process the images or video frames. 3:51 These encoders extract features, and then those features get projected into a sort of shared space where they can mix with the text features. The core transformer architecture then processes everything together. That alignment is what lets it, you know, describe an image or answer questions about a picture. Right. Makes sense. 4:07 Now that 128 k token context window Yeah. That's substantial. What are the technical implications of having such a large context? Oh, it's huge. It means the model can look back at a lot more preceding text or information when generating its response. 4:22 Think processing long documents, maintaining consistency in really long chats, understanding complex code bases. It's crucial for that. But computationally expensive. Traditionally, yes. The attention mechanism in standard transformers scales quadratically with sequence length. 4:38 So handling a 128 k tokens efficiently likely means they're using optimized attention methods, things like sparse attention or other tricks to keep the computation manageable while still letting the model see that whole large context. It really improves its grasp of long range dependencies. Function calling and structured output seem very practical for building real applications. How do those actually work? Okay. 5:01 Function calling. That's about letting the model trigger external tools like APIs. It's trained to recognize when a request needs info or action from outside, and then it generates a structured call like the function name and parameters for that tool. So it knows when it doesn't know something it needs to ask. Exactly. 5:18 Or when it needs to do something. And structured output is related. It lets the model format its response predictably, like in JSON. Super important for reliability if you're building, say, an AI agent where another system needs to parse the model's output cleanly. The official quantized versions are interesting too. 5:34 Can you break down what quantization does to these models? Sure. Quantization is basically about reducing the numerical precision of the model's weights and calculations. You go from, say, 32 bit floating point numbers down to smaller integer formats like eight bit integers, I n t eight. And the benefit is? 5:50 Smaller model size, which means less memory needed, and faster calculations, especially on hardware that's good at integer math. You can potentially get a nice speed up and run it on less powerful hardware. There can be a tiny hit to accuracy, but good quantization techniques minimize that. It's about efficiency. Google's also emphasizing the safety work behind Gemma three. 6:12 Any insights into the technical side of that? Any specific checks or methods? Well, they mentioned rigorous data governance, which is key, making sure the training data itself is vetted. Then there's fine tuning specifically for safety alignment, basically teaching the model to avoid harmful outputs based on their policies. They also talk about specific evaluations like checking its potential for misuse in generating info about, dangerous substances. 6:36 Reporting low risk there suggests those safety measures are having an effect. And Shield Gemma two, the image safety checker built on this foundation, how would that work? Right. So Shield Gemma two likely takes the visual features extracted by the core Gemma architecture and feeds them into a specialized classification layer. This layer will be trained on labeled images to spot different kinds of unsafe content violence, explicit material, etcetera. 6:59 It outputs safety labels based on that classification. And the fact it's customizable means you could probably tweak the sensitivity or maybe even train it on your own specific safety rules. The integration side looks really strong. Hugging Face, Solama, JAX, PyTorch, Keras, Google AI Edge, NVIDIA platforms. Why is that breadth so important? 7:18 It's absolutely critical for adoption. It lowers the barrier for developers. If you already use Hugging Face transformers or PyTorch, you can just slot Gemma three into your existing workflow. Alama makes local testing easy. JAX and Curas cater to other parts of the community. 7:34 And the tight integration with NVIDIA means it's ready to run well on the hardware many people already have or use in the cloud. It just makes it easy for developers to get started regardless of their preferred tools. And that ties into the hardware optimization efforts too. NVIDIA GPUs, TPUs, AMD support, even CPUs via gemma dot c p p. That covers a lot of ground. 7:53 It really does. Optimizing for specific essential for getting good performance without breaking the bank. NVIDIA support across their range from Jetson Edge devices up to Blackwell data center chips is significant. Same for Google's own TPUs. Adding ROCM support for AMD GPUs opens it up further. 8:10 And gemma dot CCP for CPUs is crucial for scenarios where you just don't have a GPU or TPU available. It gives you flexibility in deployment, matching the model to your available resources. We're also seeing this Gemmaverse emerge community projects like Sea Lion v three for Southeast Asian languages, BGGPT for Bulgarian, Omni Audio. Yeah. That's the beauty of open models. 8:32 It sparks innovation. These projects show people taking the core Gemma models and adapting them for specific regional needs like Sea Lion or less common languages like Bulgarian with BGGPT or even totally different domains like Nexa AI using it for on device audio. It shows the foundation is versatile and the community can build really cool, specialized things on top of it. It really helps democratize this tech. So for developers listening who are keen to jump in, what are the best technical starting points? 9:00 Easiest way to just try it out. Google AI Studio. It's browser based. No setup needed. You can play with the models right away. 9:07 From there, grab an API key, and you can use the Google Gen AI SDK in your own code. And for more hands on work. Download the models directly from Hugging Face, Alama, or Kaggle. Then you can use libraries like Hugging Face transformers for fine tuning. You can run that in Google Colab, Vertex AI, or even on your own machine if you have a decent GPU. 9:26 For deployment, you've got options like Vertex AI, Cloud Run with a llama, or using the NVIDIA NIMS through their API catalog. Lots of pathways depending on what you need. Okay. Well, this deep dive has really shed light on the technical aspects of Gemma three. We've traced its roots to Gemini two point o, looked at its focus on efficient performance on various hardware, its impressive multilingual and multimodal skills, and a clear effort towards safety and responsible development. 9:51 Absolutely. The combination of being open source and having this wide ecosystem integration is a big deal. It makes powerful AI much more accessible for developers and researchers, and that burgeoning Gemiverse just highlights the potential for community driven progress building on this foundation. Thank you for listening in. Subscribe and follow Colaberry on social media links in the description, and check out our website, www.colaberry.a I backslash podcast for more insights like this.