Updated March 5, 2026
0:00 Welcome to Colaberry AI podcast brought to you by Colaberry AI Research Labs and Carl Foundation. Today, we're doing a technical deep dive into Microsoft's, newly released five four series of language models. That's right. Our main info comes from the announcements on Hugging Face and the Microsoft tech community pages. We're really gonna dig into the methods, the results Yeah. 0:21 Get into the weeds a bit. Absolutely. So Microsoft built on their initial 14,000,000,000 parameter five four, you know, the one with pretty strong reasoning. Yeah. I remember that one. 0:30 Well, now they've added two new ones. Smaller, more efficient models, five four Mini Instruct at 3,800,000,000 parameters. Okay. And five four Multimodal. That one's 5,600,000,000 parameters. 0:41 And, technically speaking, it's pretty significant. They're putting them out everywhere. Everywhere. Like Like Hugging Face, the Azure AI Foundry, GitHub models, and even Alama for running locally. So quite a range. 0:52 Okay. Let's start with the Pfeiffer Mini Instruct. The specs mention improvements in, multilingual support, reasoning, math. What's changed under the hood to make that happen? Sure. 1:02 So five four Mini Instruct uses an optimized transformer architecture. They've tweaked things like the attention mechanisms and normalization layers compared to the earlier model. Cretel. Well, these changes seem to help it understand context better across more languages. And it improves, you know, the gradient flow during training that leads to more solid reasoning and math skills. 1:25 Makes sense. And a really key new feature here is something called function calling. It expands what the model can do beyond just text. Right. Function calling. 1:34 That sounds like a pretty big step for interaction. For the developers listening, how does that work technically? How do you actually use it? Okay. So function calling basically lets the model output structured JSON data that matches function definitions you set up beforehand. 1:48 So you define an API call, sort of? Exactly. It creates this decoupled way to talk to external APIs or your own custom code. Like, the example they gave shows the model making a call to get Premier League scores. Ah, okay. 2:00 The developer sets up a get Premier League results function, defines the parameters like team name or date. Got it. And then based on what the user asks, the model generates the JSON saying, call this function with these values. And then you take that JSON, run the actual function, get the data? Precisely. 2:17 Get the data, and then you can feed that result back into the model's response, making it much more dynamic. That's pretty neat. Okay. Another big point is the quantized deployment for efficiency. How do they achieve that, and what's the trade off, if any? 2:30 Right. Quantization. This is done using Microsoft Olive. That's their optimization framework along with the ONX Gen AI runtime. On n m o. 2:39 Yeah. Essentially, it converts the model's weights and activations from floating point numbers to lower precision integers like I n t eight. Which shrinks the model size. Exactly. Less memory, fewer calculations needed. 2:51 That's what lets you run it on edge devices, you know, Windows PCs, iPhones, Android phones, resource constrained places. Is there a performance hit? There can be a slight drop in precision. Yeah. But they use careful calibration techniques to keep that performance degradation really minimal. 3:07 Yeah. The fact they showed five four mini running on an iPhone 12 Pro, what that shows these optimization methods really work for on device inference. Impressive for a model of that size. Alright. Let's shift gears to five four multimodal, adding vision and audio. 3:23 That adds a whole new dimension. How does the architecture handle fusing these different inputs? Yeah. It's definitely more complex. Five four multimodal uses separate encoders for each type of input. 3:35 So a vision transformer for images, an audio encoder for sound, and the standard transformer for text. Okay. Separate processing streams initially. Right. These encoders create sort of intermediate representations. 3:46 Then these different representations get fused together using cross attention mechanisms inside the main transformer. So that's where the magic happens, the fusion. That's how it can reason jointly across text, images, and sound. Like generating code from an image, it uses the vision encoder to understand the UI screenshot or diagram. Dot combines that visual info with the text prompt you give it Mhmm. 4:06 And then figures out the code. There's actually sample code for that on the GitHub thick book if you wanna see it in action. We'll make sure that link's available. Now the audio part in five four multimodal, that opens up a lot of possibilities for interaction. Can you break down the technical side of those audio features? 4:23 Sure. The audio functions cover things like, pulling out audio samples, actual voice interaction, and even audio translation. Okay. Behind the scenes, there's a dedicated audio processing pipeline. It handles feature extraction, encoding the audio. 4:38 For instance, the sample code for extracting audio snippets likely involves segmenting the audio and transcribing it. Like for summarizing a meeting recording? Potentially, yeah. Yeah. And the voice interaction demo, like the Siri example they have that would use automatic speech recognition ASR to turn speech into text. 4:55 Standard stuff there. Then the language model processes the text, and it might use text to speech TTS to generate a spoken reply. Okay. And similarly for translation ASR for the input language, the model does the translation and then TTS for the output language. It integrates all these pieces. 5:12 Got it. So the original five four was noted for its reasoning. How well do these smaller multimodal versions hold on to that reasoning ability, especially when they're dealing with images and text together. That's a key point. They seem to preserve those strong reasoning capabilities, likely because they continue to use high quality, carefully curated training data that emphasizes logic and problem solving. 5:37 So the training data is crucial. Absolutely. And when you combine that reasoning with multimodality, the model can tackle more complex tasks by drawing information from different sources. Yeah. The example of generating structured project code from an image and a prompt really shows this. 5:52 How so? Though the model isn't just recognizing objects in the image, it's analyzing the visual structure, the elements there, combining that understanding with the text instructions, and then reasoning about the actual code structure, the components needed to build what's requested. So it's cross modal reasoning, not just stitching things together. Exactly. It demonstrates a deeper level of understanding. 6:12 Again, there's sample code for this advanced reasoning on the FightCookbook GitHub if people want to explore it. Okay. Final point then. Considering these are relatively small models, parameter wise, 3.8 b and 5.6 b, their performance is apparently comparable to some much larger LLMs. How did they measure this, and what does it really mean for putting them on edge devices? 6:35 Yeah. That efficiency is impressive. The evaluations probably used a whole suite of standard NLP benchmarks. Yeah. Things testing language understanding, reasoning generation, you know, the usual suspects. 6:45 Like GLUE or super GLU? The likely things in that vein. Yes. Yeah. The specific metrics would be in the tech reports and papers they've linked, but the main takeaway is they perform competitively on many tasks despite being much smaller. 6:58 And that efficiency translates starkly too. To suitability for edge deployment. It means you can have capable generative AI running on the device, phones, PCs, IoT systems, even with limited computing power or no constant Internet connection. Which opens up a lot of possibility. Definitely. 7:14 Real time, on device AI becomes much more feasible across many applications. Think smarter assistance on your phone, localized controls in factories or homes, lots of potential. For anyone wanting the nitty gritty details, check out the Microsoft PHY cookbook, the PHY four multimodal tech report, and the main PHY four paper. Great resources. Thanks for breaking all that down. 7:34 It really feels like this PHY four series is a significant step, especially for making AI more efficient and versatile, particularly in those resource constrained scenarios. I agree. The focus on multimodality combined with that efficiency and strong reasoning, it points towards a very interesting direction for applied generative AI going forward. Thank you for listening in. Subscribe and follow Colaberry on social media links in the description, and check out our website, www.colaberry.ai backslash podcast for more insights like this.