Updated February 27, 2026
0:00 Welcome to Colaberry AI podcast brought to you by Colaberry AI Research Labs and Carl Foundation. It's great to be here. You know, usually, when we sit down to talk about a new model release, we're, we're talking about something designed to chat with you. Right. It writes a poem or it summarizes a PDF, something like that. 0:16 Exactly. But today, we're looking at something that is, I mean, a whole different ballgame. Microsoft just dropped a model called Optum. Optum SFT to be specific. And, yeah, it is not interested in chatting. 0:27 It is, designed to solve really hard math problems. And not just any math. This is the kind of math that, you know, actually runs huge parts of the global economy. That's right. This isn't your high school calculus class. 0:39 This is industrial scale optimization, things like logistics, manufacturing, supply chains. I think people often gloss over this part of AI for business. They think it's about writing marketing copy. Sure. But the real engine room of a big company, how you schedule cruise for an airline, or how you route trucks for a delivery service Mhmm. 0:58 That all runs on these things called optimization solvers. Exactly. The industry standard, the big one, is a piece of software called Gurobi. Gurobi is pretty much the gold standard. It is. 1:10 And you could think of Gurobi as, like, this perfect powerful magic box. Yeah. You feed it a very precise mathematical model. Okay. Variables, constraints, and objective, like, minimize cost. 1:22 And it just crunches through billions of options to find the one single best answer. The solver itself is not the bottleneck. The bottleneck is the human, the person trying to explain the problem to the solver. Precisely. That is exactly it. 1:36 We call it the translation problem. A real world problem starts out as a messy email. We need to deliver 500 units by Friday, but truck a is broken. And we can't use route 66. Yeah. 1:48 Turning that mess of natural language into what's called a mixed integer linear program or a MLMPL, it's brutal. And a MLlib isn't something you can just kinda wing. It's a super strict mathematical format. Oh, it's totally unforgiving. You miss one inequality or you define a variable wrong, and the whole thing either crashes or worse, gives you a solution that looks right but is physically impossible. 2:08 So you need this really scarce expert probably with a PhD to do that translation, and that's what Microsoft built Optum to automate. It automates that specific translation layer. And the mission for this deep dive is to really get into the weeds of how they did it because the architecture and the, the whole inference strategy are just fascinating. It's a great case study in building a domain specific AI. This is a specialist tool, not a generalist. 2:32 Okay. So let's start with the architecture. We're looking at a 20,000,000,000 parameter model, but it's not a standard dense model, is it? No. It's not. 2:39 The base model is from the g p t 20 b family, but they built it using a mixture of experts architecture. And MOE. We're seeing MOE everywhere now, Mixtural GPT four. But why use it for, you know, a math coding model? It's all about a trade off between knowledge inference cost. 2:57 Okay. With 20,000,000,000 parameters, you have this massive amount of storage for knowledge. Those are syntax, math concepts, coding patterns. But if you had to activate all 20,000,000,000 for every single token you generate, it would be incredibly slow and expensive. You'd need a huge GPU cluster just to get one line of code out. 3:16 Exactly. With the MOE, the model is smart. It routes the input to specific expert subnetworks. Right. So even though the total count is 20,000,000,000 parameters, only about 3,600,000,000 are actually active for any given token. 3:29 Wow. Okay. So that's a huge efficiency game. You get the brain of a giant model, but the speed of something much smaller. Right. 3:36 Closer to a 3,000,000,000 parameter model at runtime. And that matters because this kind of code is long and detailed. Speed is important. The other spec that just jumped out of me was the context window. A 128,000 tokens. 3:49 Need a massive. Why does a math model need a context window that big? You aren't feeding it a novel. You basically are. Real industrial problems are incredibly dense. 3:57 Think about a factory production schedule. You've got machine capacity, maintenance times, labor laws, material deliveries, order deadlines. The spec document alone could be dozens of pages. And the model needs to hold all of that in its working memory at the same time. If it forgets a constraint from page two when it's writing the code for something on page 40, the whole solution is garbage. 4:19 That 128 k window lets it maintain that global picture. And the output, it's not just an explanation. No. It's executable code. It spits out Python using Groby p, which is the official Groby library. 4:32 It defines the variables, adds the constraints, calls the optimizer, and even writes the code to print the solution. It's end to end. Okay. But anyone who's tried to get an LLM to code knows the big risk is hallucination. It just makes something up. 4:47 With math, that seems even more dangerous. It is. Because the code might run, but the math could be subtly wrong. And this is where their methodology gets really, really interesting. What did they do? 4:57 They used a technique they call class based error analysis. They figured out that the public benchmark datasets for this field, things like OR and STRUCT, were just really noisy. Noisy. How? Like, the training data itself was wrong? 5:10 Exactly. Missing parameters in the problems or even just flat out incorrect correct solutions. If you train a model on that garbage, you're just teaching it to make mistakes. So they had to clean it up first? They did. 5:20 They sorted all the problems into 53 seed classes. Like the classic problem types. Yep. Traveling salesman problem, bin packing, flow shop scheduling, all the greatest hits. Then they ran a base model to see where it failed and brought in actual human experts to analyze those failures. 5:37 Ah, the human in the loop. It's so critical. The experts wrote what they called error descriptions and hint pairs for each class. The best example is for the traveling salesman problem, the TSP. Okay. 5:48 Let's drill into that. What's a common way an AI messes up a TSP? So in a TSP, you have to visit a bunch of cities exactly once. A common failure mode is generating a solution with subtours. Subtours. 6:00 Imagine you have five cities. A valid route is a, b, c, d, e, and back to a. A subtour error would be creating two separate little loops like a, b, a, and c, d, e, c. So the math says every city is visited, but the truck can't actually drive that route. Right. 6:14 It's physically impossible. To prevent that specific error, you need a special kind of constraint called the Miller Tucker Zemlyn constraint. And a general purpose LLM is probably not gonna know that. It almost never does. So the experts wrote a hint. 6:29 If you're solving a TSP, you must include these constraints to eliminate subtours. They used hints like that to regenerate the training data, forcing the model to learn the correct robust formulation. And then use majority voting to clean up the whole dataset. So the model isn't learning from messy web scrapes. It's learning from expert corrected examples. 6:49 And that's the secret sauce. That data quality is what drives the whole performance lift. So let's talk about what happens at at runtime at inference because this is something developers listening will wanna know. It's not just a simple prompt in code out. It feels more agentic. 7:05 It is very agentic. The pipeline is designed to kinda mimic how a human expert would think. It starts with classification. It figures out what kind of problem it's looking at first. Right. 7:15 It reads your prompt. It goes, okay. This looks like a bin packing problem. Once it has the class, it moves to step two, augmentation. And that's where it pulls in those hints we just talked about. 7:25 Plus slightly. It retrieves that cheat sheet for that specific class and sticks it right at the top of the context. So before it even starts writing, it's reminding itself of the common mistakes. It's like a pilot going through a preflight checklist. Check for subtours. 7:40 Check for capacity constraints. That's a great analogy. Yeah. Then comes generation. It writes out its reasoning, the math, and then finally the Python code. 7:49 But they added another layer on top called test time scaling. This is that self consistency idea, right, where you generate multiple answers and vote. Yes. The model doesn't just generate one script. It might generate, say, five or 10 different versions of the code. 8:04 Then it actually executes all of them. And then compares the answers. It takes vote. If eight out of 10 scripts produce a final cost of, say, $500 and two of them error out, it trusts the majority. It's using consensus to weed out the hallucinations. 8:21 But what happens when the code does error out? A simple syntax error could kill the whole process. That's the final piece, multi turn correction. If the generated code fails, the system captures the Gurobi SilverLog or the Python error message. And it feeds the error back into the model. 8:39 It does. It basically says, hey. Your last attempt failed. Here's the error log. Try again. 8:43 The model then reads the error, figures out what it did wrong, and rewrites the code. That is incredibly powerful. It's debugging itself in real time. It has latency for sure. Yeah. 8:53 But in optimization, you'd much rather wait three minutes for the right answer than thirty seconds for the wrong one. So the big question, does all this complexity actually work? What do the results look like? The numbers are really strong. Microsoft is reporting a 20.7% improvement in formulation accuracy on their cleaned up benchmarks compared to the base model without all this stuff. 9:13 And how does it stack up against, you know, the big proprietary models? It performs competitively with models like GPT four mini and even GPT four o on these very specific tasks, and it just blows away other open source models in its size class. That's a huge win for the OpenWeights community. And that brings us to, I think, the most important part of this deep dive, the things to try section. Because this isn't just a paper. 9:37 You can actually download and run this. This is the part that gets me really excited. They released it under the MIT license. We really can't overstate how important that is. It's huge. 9:48 So many open models these days have these restrictive licenses, noncommercial use only or things like that. Yeah. MIT means it is truly genuinely open. You can build a commercial product on top of this. So if I'm a developer at a logistics company, I can take this model, fine tune it on my own data, and deploy it all without paying Microsoft a dime in royalties. 10:09 Exactly. But, and this is a really big but, you need the right hardware. Let's talk specs. What do I need to run this? Can I fire it up on my laptop? 10:17 Yeah. Just snapples. No. No. Put the laptop away. 10:20 This is, this is data center grade hardware territory. Okay. Break it down for us. For inference, just to run it, you need at least 32 gigabytes of GPU VRAM. And you'll really want something with high memory bandwidth. 10:33 So you're looking at an NVIDIA a 100, an h 100, or one of the new b two hundreds? Definitely a a cloud deployment, though. For sure. I mean, they trained it on eight b two hundreds. You might get away with a single a 100 with 80 gigs of VRAM if you're careful, but that VRAM is critical for the huge context. 10:49 Okay. So I've got the GPU. How do I actually serve the model? Their recommendation, and it's a good one, is to use a framework called SG Link. SG Link. 10:57 Structured generation language. It's a high performance serving framework that's really optimized for models with complex inference logic and big context windows. It's very efficient with the memory. And what's the workflow like? Like? 11:08 It's actually surprisingly easy. You install slang and grow by p. You you launch the server, point it at the Microsoft Optim SFT model on Hugging Face. And the best part is that SG Lang exposes an OpenAI compatible API endpoint. Oh, I love when tools do that. 11:25 It makes integration a breeze. You could just use the standard Python OpenAI client library. You just change the base URL to point to your local SG Lang server instead of OpenAI's. So any code I have that talks to GPT four, I just swap out one line, and now it's hitting my own self hosted Optum model? That's it. 11:42 Now for the generation parameters, they have specific recommendations. A temperature point nine and a top p of one point zero. Wait. Point nine. That seems really high for code generation. 11:52 Usually, you want that to be close to zero. It is interesting, isn't it? Yeah. My guess is that it's because of that self consistency, the voting step. They want some diversity in the generated samples, so the voting process has something to work with. 12:04 If the temperature is zero, every sample would be identical. That makes perfect sense. You need variety to find the consensus. Exactly. And you wanna set your max tokens high, something like forty ninety six, because these formulations can get pretty long. 12:17 Now what about dependencies? We mentioned Gurobi. Yes. And this is key. You absolutely must have a valid Gurobi license. 12:25 The model writes code that calls Gurobi. If you run that code on a machine without a license, it's just gonna crash. And Gurobi has free academic licenses, but for commercial use, you're paying for it. Correct. The model is MIT licensed and free, but the solver it talks to is a commercial product. 12:44 Also, make sure your environment is running Python 3.12 or newer. Okay. So we've got the hardware, the software, the license. We have to talk about safety. We are generating and then executing code here. 12:55 This is the giant red warning label on the entire project. Microsoft is very upfront about this. They did no safety guard railing for things like hate speech, but that's not the real risk here. The real risk is that you're letting an AI write a script and then running it on your machine. Yes. 13:10 If the model hallucinates a line of code that says o s dot system r m dash m a r f, and you're running that automatically, well, you just wiped your server. Or it could just create an infinite loop and crash your system. Right. Sandboxing is not optional. It is mandatory. 13:26 You must run this in a secure isolated environment. A Docker container with strict resource limits and no network access is the bare minimum. You have to treat the output as completely untrusted code. Always. Inspect it or run it in a jail. 13:41 And beyond the code safety, there's the domain safety. Right. They specifically warn against using this for safety critical applications. Things like health care scheduling, financial credit scoring. Optimization is amoral. 13:54 If you tell it to minimize costs at a hospital, it might generate a schedule that overworks nurses to the breaking point because on paper, that's mathematically cheaper. The model has no concept of ethics. It only knows the objective function you give it. And that's why you still need a human expert in the loop. This model is an incredible co pilot. 14:11 It democratizes access to the solver. It lets a business analyst do something that used to require a PhD in operations research. But you still need a human to look at the final solution and ask, does this actually make sense in the real world? It bridges that gap between having a messy business problem and having the Gurobi code to solve it. And that gap has been the biggest barrier to entry for this technology for decades. 14:35 It's a really exciting time. If you have the hardware, I'd say pull down the model. Try feeding it a Sudoku puzzle just described in English or a fantasy football problem, and just watch what it does. It's an amazing learning tool. Just, remember to check your Gurobi license first. 14:49 And sandbox the execution. Always sandbox. Thank you for listening in. Subscribe and follow Colaberry on social media links in the description, and check out our website, www.colaberry.a I backslash podcast for more insights like this.