Updated March 5, 2026
0:00 Welcome to Colaberry AI podcast brought to you by Colaberry AI Research Labs and Carl Foundation. Today, we're embarking on a detailed exploration of Nari Labs DAIA 1.6 b. Right. It's this, open source text to speech model that's been getting quite a bit of buzz. Yeah. 0:17 Definitely making waves. Our goal here is really to dive into the, you know, the technical side of things, what makes it tick, what it can actually do. Mhmm. And specifically, how it handles dialogue, that expressive multi speaker stuff. Exactly. 0:29 And we're pulling information straight from Nari Labs, their docs, the code examples, trying to get a solid understanding. Good plan. So the mission really is to figure out why DAO 1.6 b is a significant step, especially for generating, like, real sounding conversations. Got it. So let's start with the core idea. 0:48 What's the main thing dial 1.6 speed does? Well, at its heart, it's designed for, super realistic, very expressive dialogue generation straight from a text script. Okay. Dialogue generation. So not just reading text aloud. 1:01 Precisely. It's built to handle that back and forth, the the different voices you hear in a real conversation. It's a step beyond your basic TTS. That sounds like a big leap. It is. 1:12 And a huge part of why it's interesting open source. Right. You mentioned that. Yeah. The model weights, the code to run it, it's all out there. 1:19 Apache two point o license. Which means people can actually use it, build on it, experiment freely. Exactly. Yeah. It positions it as a really strong open alternative to, say, Eleven Labs or maybe Sesame CSM one b, those proprietary systems. 1:34 Okay. That makes sense. So for this level of realism, what's under the hood, technically speaking? Well, first off, the model size is pretty substantial. We're talking 1,600,000,000 parameters. 1:44 1,600,000,000. Wow. So that size is key for capturing the details. That's a big part of it. Yeah. 1:50 Yeah. It lets the model learn or replicate all those, like, tiny variations and nuances in human speech. Okay. And how does it handle the different speakers in a dialogue? That's done using specific tags right in the input text. 2:03 Things like s one and s two. Like speaker labels. Exactly. You label who's saying what in the script, and the model uses that to generate distinct voices for each speaker. So the script tells it speaker one says this, speaker two says that, and it creates different voices based on those tags. 2:20 Precisely. It knows who's supposed to be talking. Clever. What else? There's also something called audio conditioning. 2:27 This is pretty interesting. Audio conditioning. What's that? It means you can give the model a short audio clip, like a prompt. Okay. 2:35 And that audio prompt influences the output. Things like the emotion, the tone, the overall prosody of the generated speech. So you can guide the style of the voice with an example sound. Yeah. Exactly. 2:46 It gives you this extra layer of control over how expressive the output is. You can kind of nudge it towards a certain feel. That sounds powerful for getting specific performances. It really is. And related to that, there's voice cloning too. 2:59 Ah, I saw that mentioned. Cloning from a short sample. Yep. They have an example script, voiceclone.pyoi, that shows how you can basically replicate someone's voice from just a brief recording. No extra training needed for that specific function. 3:13 Impressive. And language support, is it multilingual? For now, it's English only. That's what the current model supports. Got it. 3:20 English for now. You know, the speaker tags and audio conditioning offer a lot of control. I also saw it can handle nonverbal sounds. Oh, yeah. That's another neat feature for realism. 3:31 How does that work? Do you just type cough? Kind of. You use specific text tags like Right. Laughs, coughs, sighs, clears throat, you know, things like that. 3:40 Seriously, you put laughs in the script. And the model generates the laughing sound as part of the audio output. Wow. Yeah. And I don't understand the whole list of these. 3:48 Gasps, singing, mumbles, even beaks or claps sometimes. That adds a whole different dimension. It really does. It makes the synthesized conversations feel much more, well, human. They do add a little note, though, that some of the, less common or below vocal tags might give you sort of unexpected results sometimes. 4:06 Right. Always good to know the caveats. Okay. So for people listening who think, I wanna try this, what's the process? How do you run Dial locally? 4:14 It's actually pretty straightforward if you're used to working with code from GitHub. The quick start is basically clone the repository. Standard git clone command. Yep. Git clone https.github.com/nari-labsadaiya.git. 4:31 Then you see the into the adaiya directory. Okay. And from there, they suggest using UV, the package manager. So you'd run UV run app dot py. UV run app dot py y. 4:41 What if you don't use UV? There's the classic vend approach too. You know? Create a virtual environment, activate it, install UV inside that with pip, and then run the same UV run app dot py command. Gotcha. 4:52 So standard Python project setup. Pretty much. And running that command actually fires up a Gradio user interface. Oh, nice. So there's a graphical way to interact with it too. 5:00 Yeah. Which makes it much easier to just play around it initially without diving straight into Python code. That's handy. Now I read something interesting about the voice, that the best model isn't fine tuned on one specific voice. That's right. 5:10 It's a key point. Because it hasn't been tuned to sound like one particular person if you just run it multiple times without any conditioning. You might get slightly different voices each time. Exactly. The default voice isn't fixed. 5:22 It has some variability. Okay. So if you need consistency, say, for a character across multiple lines, how do you handle that? Two main ways they suggest. First is using that audio prompt we talked about, the audio conditioning. 5:35 Right. Guide it with a sound sample. Yeah. They mentioned a guide for that was forthcoming, but the Grideo UI lets you experiment with it. The second way is simpler. 5:45 Just fix the random seed. The classic seed fixing? Yeah. Set a specific seed value Mhmm. And you should get much more consistent vocal output for the same text input across different runs. 5:56 That makes sense for a reproduce Okay. And for developers wanting to build this into an application, what's the Python library usage look like? It's quite clean, actually. The example shows you import sound file that's for saving the audio. Right. 6:11 And you import the dia class from dia dot model. Then you instantiate the model. There's a helper method, dia dot from pretrained, Nerealabs dia 1.6 b. Mhmm. That pulls down the model weights. 6:23 Simple enough. Then you just create your text string, making sure to include those s one, s two tags, and any nonverbal tags like like The full script with the tag. Exactly. You pass that string to the model dot generate method. And that returns. 6:35 It gives you back the audio data, usually as a NumPy array Okay. Which you can then save to a file using sound file dot write. Just give it the file name, the audio data array, and the sample rate, which is 44.1 kilohertz for this model, Ford Home to 100. Seems pretty developer friendly. So let's switch gears to hardware. 6:53 What kind of machine do you need to run this effectively? Right now, it's definitely geared towards GPUs, NVIDIA GPUs specifically. Makes sense for a model this size. Yeah. You need PyTorch two point o or newer and CUDA 12.6. 7:06 They do mention CPU support as planned, but currently, GPU is the way to go for performance. Any specific GPU models mentioned or minimum requirements? They tested on an a 4,000. The main thing is the VRAM requirement. Yes. 7:18 How much? The full model takes up about 10 gigabytes of GPU VRAM. 10 GB. Okay. So not insignificant. 7:25 You need a decent card. Definitely. Oh, and one other thing. The very first time you run it, it it might take a bit longer because it has to download something called the Descript Audio Codec. A dependency makes sense. 7:35 Yeah. It uses that codec internally. Okay. 10 GB VRAM. What about speed? 7:40 How fast does it generate the audio? The reference point they give is using that NVIDIA a 4,000 GPU. On that, it generates roughly 40 tokens per second. 40 tokens per second. How does that translate to actual audio time? 7:54 They estimate that about 86 tokens make up one second of audio. Okay. So it's generating roughly half a second of audio per second of processing time on that hardware. Something like that. Yeah. 8:05 It's reasonably fast, but maybe not quite real time for very long stretches on that specific GPU. They do note that if your GPU supports it, using torch dot compile can speed things up. Right. PyTorch's compilation feature. Yeah. 8:19 That can give you a nice boost. Good to know. And you mentioned the 10 GB VRAM. Any plans to make it lighter? Yes. 8:24 Definitely. They've explicitly mentioned that a quantized version is planned for the future. Ah, quantization. That usually shrinks the model size and memory usage significantly. Exactly. 8:36 That would make it accessible to run on hardware with less VRAM, which would be a big plus. Absolutely. So looking ahead, besides quantization, what else is on the roadmap for Daya? They mentioned a few things. Docker support is planned. 8:50 Oh, good. That simplifies deployment a lot. For sure. Also, continued optimization for inference speed trying to make it generate audio faster. Always welcome. 9:00 And just generally building on it. Since it's open source, they're inviting contributions from the community too. Makes sense. Before we wrap up, any ethical considerations mentioned with powerful voice cloning and generation, that's always a topic. Yes. 9:12 They include an ethical disclaimer. It emphasizes responsible use Right. Like prohibiting misuse for impersonation, creating deceptive content, or any illegal or malicious purposes. Standard but important guardrails for this kind of tech. Definitely crucial as these models get better and more accessible. 9:30 Absolutely. So wrapping things up, dial 1.6 really seems like a a quite a powerful step forward for open source TTS, especially for dialogue. I'd agree. That focus on multi speaker conversations, the nonvocal sounds, the conditioning features, all combined with being open. Mhmm. 9:46 It's a compelling package. Yeah. It opens up a lot of possibilities for creators, developers, researchers. For sure. It's a significant asset in the speech synthesis world right now. 9:55 So maybe a final thought for you, our listener. Yeah. Think about the implications. What does it mean when technology this advanced for generating realistic human dialogue becomes widely accessible? Consider the impact on everything from content creation and entertainment to accessibility tools and, of course, that constant need need for responsible development and use as things keep moving so fast. 10:15 Thank you for listening in. Subscribe and follow Colaberry on social media links in the description, and check out our website, www.colaberry.a I backslash podcast for more insights like this.