Updated March 5, 2026
0:00 Welcome to Colaberry AI podcast, brought to you by Colaberry AI Research Labs and Carl Foundation. Okay. Let's unpack this. We're diving deep today into some, really interesting source material. It's all centered around a specific project from Google AI Edge. 0:13 What we've got here is mostly excerpts from a GitHub repository, Readme. You know, the official description for the Google AI Edge gallery project. It kinda lays out what it is, what it does, how it works. Now just for clarity, we also have a link, something about Jordan Peterson. But looking at it, that seems totally separate. 0:30 Not related to this Google AI Edge gallery stuff at all. So, yeah. For this deep dive, we're sticking purely to this GitHub redeye. Our mission then is pretty straightforward. Let's pull this read only apart, figure out what exactly this Google AI Edge gallery is, why it matters in the, you know, fast moving world of AI, and really get into the technical nuts and bolts. 0:49 We wanna understand not just what it showcases the things you, the listener, could actually try, but also how it pulls this off, technically speaking. Absolutely. And what really grabs you from the start, start, is that core technical idea that Ray of Me puts front and center, on device MLG and AI. This isn't your standard cloud AI setup. We're talking about running machine learning, even these quite advanced generative AI models directly on your device. 1:12 So think your smartphone, your tablet, whatever edge device you have. It's about doing the AI heavy lifting locally without necessarily needing that constant lifeline back to a massive data center. And this gallery project, it's like a tangible working example of how that's possible. Okay. So first impressions from the reedemy. 1:28 It calls this the Google AI Edge Gallery, an an experimental app. What exactly is it based on this description? The name gallery is kinda suggestive. Yeah. Experimental app is the term they use, and it has a very specific technical goal. 1:42 It's designed to, well, first, showcase these on device generative AI capabilities. Second, to let people actually experience them. And third, really importantly, to let folks evaluate how they perform. And it does all this using the tech from Google AI Edge. The gallery part fits perfectly. 2:00 Right? It's like walking through different exhibits, different use cases of on device AI you can actually play with, see it running live right there on the hardware. Right. A collection of demos. Exactly. 2:10 And technically speaking, the RIDEAN is clear on availability. It says it's currently on Android, and there's a little note, iOS coming soon. So that tells you where the initial engineering effort went, doesn't it? Yeah. Focusing on Android first. 2:23 Mhmm. Makes sense. Dealing with hardware optimization probably isn't simple across platforms. Definitely. Especially when you're aiming for efficiency at this level. 2:30 Okay. So the READY may highlight some core features, capabilities that, seem pretty significant technically. Let's start with the big one, the one that really defines it. Run locally, fully offline. What does that mean in technical terms? 2:45 That phrase, it's more than just a feature. You know? It's the architecture. It's the core differentiator. When it says all processing happens directly on your device, it means the entire AI inference pipeline taking your input, whether it's text or an image, running it through the model, getting the output, all of that computation happens using just the processor, memory, and maybe specialized chips on your phone or tablet. 3:08 It's a complete departure from the cloud model where you send your stuff off, wait, and get a result back. Right. No data leaving the device. Exactly. And that has some major technical advantages. 3:17 First off, latency. It gets slashed dramatically. There's no network delay, no waiting for data to travel back and forth. The only limit is how fast your device itself can crunch the numbers. For anything interactive, that's huge. 3:30 It's a difference between feeling smooth and feeling, well, laggy. Yeah. That immediate response. Totally. Then there's privacy and security. 3:37 This is a big one. Since your data, your prompts, your images, whatever, never leaves your device, it's inherently more private. It's not going across the Internet, not sitting on some third party server. It stays within that secure boundary of your own hardware. Makes sense. 3:51 You control the data. You do. And thirdly, it just works offline. Once you have the models downloaded onto the device, you don't need Wi Fi. You don't need a cell signal. 4:00 Think about using AI tools on a plane or out in the middle of nowhere or just when you prefer to be disconnected. It opens up a lot of scenarios where Cloud AI just wouldn't function. So this local offline power comes directly from Google AI Edge? Precisely. The Ria Ding calls AI Edge the core APIs and tools for on device ML. 4:20 So Google's built this whole toolkit, libraries, frameworks, optimized model formats, probably specifically to make it easier for developers to build apps like this gallery, apps that run ML efficiently on edge hardware. The gallery is basically a living demonstration built with those AI Edge tools. It proves the concept. Got it. Okay. 4:37 So the app is this gallery structure letting you try different things. The Reademy lists specific core features, the practical stuff you can actually interact with. Yeah. Let's walk through the tech behind these. First up, choose your model. 4:51 It says you can easily switch between different models from Hugging Face. How does that work, technically? Sounds flexible. Oh, it's definitely flexible, and it points to some significant technical plumbing under the hood. Hugging Face, as you know, it's this massive hub for ML models. 5:06 So the fact that this app lets you switch between models from Hugging Face means it's not just stuck with one built in model. It has a way to talk to that external repository. So it has to, like, browse Hugging Face somehow? Yeah. Likely. 5:20 Technically, the app probably needs code to first query or browse Hugging Face for models that are compatible, and compatibility is crucial here. The models need to be in a format. Literacy or Day, the runtime we'll get to, can actually understand and run. Second, once you pick one, the app has to handle the download securely, reliably, getting that model file onto your device's storage. And these models, especially LLMs or vision models, can be chunky hundreds of megs, maybe gigs. 5:48 Wow. Yeah. Managing that download on a phone could be tricky. For sure. Think connection drops, storage space. 5:55 Yeah. Not trivial. Third, after it's downloaded, the app needs to be able to dynamically load that model into the Inference Engine. So switch models without restarting the whole app. It means unloading the old one, loading the new one, maybe setting up the computational graph again on the graph again on the device hardware. 6:09 And it mentions comparing performance too. Right. That implies the app has built in technical tools, instrumentation to measure things like how fast each model runs, how much memory it uses, maybe even power draw. For a gallery meant for evaluation, that's key, it lets you technically compare different models from hugging face, maybe different sizes or architectures on your specific phone, see how they actually perform in the real world. That on device flexibility and benchmarking is, pretty cool, technically. 6:37 Okay. Next feature, ask image. You upload an image, ask questions, get description, solve problems. What kind of on device model are we talking about here? This is a clear example of an on device vision language model or VLM. 6:49 That's a type of AI model specifically engineered to handle two different kinds of information at once. Visual stuff, like the pixels in an image, and text, like your question. For this to happen fully on the device, the VLM itself, the code, the learned parameters, has to be stored locally and run by the device's engine. So you upload the image, the app gets the pixel data. You type a question, it processes the text. 7:13 Both these streams, visual and textual, get fed into the VLM running right there. The model's internal architecture, often complex neural nets like transformers with special vision and language parts, needs to be super optimized to do that cross modal reasoning using only the phone's processor. That sounds computationally intensive. Images and language together. It really is. 7:32 Images are dense data. Processing them efficiently on mobile needs optimized pipelines. Language needs its own processing. Then the model has to fuse that understanding and generate a relevant text answer. Answer. 7:43 Doing all that quickly enough to feel interactive while juggling the limited memory and power requires clever model design and serious use of hardware accelerators, the GPU, or specialized NPUs if the phone has them. It really showcases running these advanced multimodal models right at the edge. Alright. Then there's the prompt lab. This one focuses on single turn large language model LLM use cases, things like summarizing, rewriting, coding, freeform prompts. 8:10 Yeah. The prompt lab gets down to the core text in, text out job of an LLM, but running locally. Single turn here has a specific technical meaning. It means each interaction is self contained. You give it a prompt, summarizes text, write some code, whatever, and the LLM processes just that input to give you an output. 8:26 It doesn't remember what you asked before in this mode, like a clean slate every time. Okay. So no memory of the conversation. Exactly. The main technical hurdle here is just getting an LLM, which can have billions of parameters and needs tons of memory to run effectively on a phone's limited resources. 8:42 LLMs do a staggering number of calculations, mostly matrix math. Doing that fast on mobile without killing the battery or making the phone overheat needs highly optimized models and run times. So techniques to shrink the model or make it faster? Absolutely essential. Things like model quantization are key. 9:00 That means reducing the precision of the numbers the model uses may be going from 32 bit floating point down to eight bit integers or even four bit. That drastically cuts down the memory needed and the computational cost. You might also see model pruning where less important parts of the network are removed or specialized faster transformer designs. The prompt lab basically proves that these big powerful models can be made small and efficient enough for local execution, handling one request at a time. Then right after that, we have AI chat, which is for multi turn conversations. 9:30 How is that technically different from the prompts lab? Okay. This adds a really significant layer of technical complexity. Context management. In a chat, the AI's reply needs to make sense based on what's already been said. 9:43 Right? It needs a memory of the dialogue. So, technically, every time you send a message in AI chat, the system doesn't just feed your new message to the model. It usually builds a bigger input that includes the conversation history, maybe the last few turns, or a summary plus your new prompt. This combined longer input goes into the LLM. 10:01 The model now has to process potentially much more text and understand the flow of the conversation. Which sounds like even more work for the device. Exactly. The technical challenges ramp up. You need extra memory just to store the chat history. 10:14 Processing that longer input sequence for every single turn takes more computation, more time, more power. Getting this right on device requires clever tricks. Maybe the system only keeps a sliding win laying of recent history, or maybe it uses another model to summarize older parts of the chat, or maybe the LLM itself is designed specifically to handle long context efficiently on mobile hardware. So AI chat technically demonstrates maintaining that stateful ongoing dialogue locally, which is definitely a step up in sophistication from the single turn prompt lab. The app also provides performance insights. 10:47 It lists real time benchmarks, t t f t, decode speed, latency. Can you break those down? Why are they so important for on device AI? Right. These benchmarks give you the heart numbers on how fast and responsive the models are running on your specific device. 11:00 And they are critical for on device stuff because, like we've been saying, you're working with tight constraints. Performance isn't just nice to have. It hits user experience, battery life, heat, everything. Okay. Let's take them one by one. 11:15 TTFT. That's time to first token. It measured how long it takes from when you send the prompt to when the very first piece of the response appears. Technically, this covers processing your input, setting up the model, and calculating that initial output token. Low t t f t is vital for making an app feel responsive. 11:31 You wanna see it start typing back almost instantly. Right? Otherwise, it feels like it's stuck. Exactly. High t t f t makes it feel sluggish. 11:37 So optimizing that initial calculation is a huge technical focus for on device AI. Then there's decode speed. This measures how fast the model spits out the rest of the tokens after the first one. It's usually measured in tokens per second. Once that first token is out, the model keeps generating token by token until it's done. 11:55 Decode speed tells you the throughput of that process. Higher speed means the text flows faster onto the screen. On device, getting good decode speed means making those core model calculations, especially matrix math, run super efficiently on the phone specific hardware accelerators, minimizing gaps between tokens. Okay. So TTFT is the start time. 12:14 Decode speed is the running speed. What about latency? Latency is often the umbrella term for the total time from input to the complete output. Sometimes people use it loosely, but in generative models, it can mean the whole duration. High latency yields long wait. 12:28 And for on device AI, every millisecond of computation burns battery and generates heat. So minimizing all aspects of latency with the TTFT and the time taken by the decode speed is a primary technical goal for the runtime and model optimizations. And the app shows these in real time. That seems really useful for evaluation. Hugely useful. 12:46 It's a significant technical feature. It lets anyone, developers, researchers, even just curious users load different models, maybe using choose your model or the next feature we'll talk about, and see exactly how they perform on their hardware. It gives you concrete data to compare models or see the impact of different optimization techniques. It makes the hidden performance details of on device inference totally transparent and measurable. Speaking of other features, there's one that definitely sounds aimed at developers, bring your own model. 13:14 Testing local literati task models. Very technical phrasing. What does this let people do? Yeah. This one's squarely aimed at the folks actually building or tweaking on device models. 13:27 It essentially provides a test bed, a sandbox right inside the gallery app for trying out custom models they've created. The key technical requirements mentioned are that the models need to be in this specific dot task file format, and they have to be compatible with Literacy. Okay. So what are Literacy and dot task? Right. 13:44 Literacy pops up again in the technology highlights section. It's described as the lightweight runtime for optimized model execution. Execution. So this is the actual software engine in the gallery app that takes a model file and runs its calculations on the device hardware. Lightweight is crucial because on mobile, memory and CPU cycles are precious. 14:04 The runtime itself needs to be lean. Optimized means it's smart about using the hardware efficiently leveraging the CPU special instructions, pushing work to the GPU, or ideally using dedicated MPUs, neural processing units if the phone has them, to make the model run fast without draining the battery, And the dot task format, that's likely the specific file package literally expects. It probably bundles the model structure, its weights, maybe some configuration info that LiterT needs to load and run it efficiently. Developers would need tools probably from Google AI Edge to convert their models from standard formats like TensorFlow Lite into this specific dot task format. So a developer could train their own model, convert it to dot task, load it into the gallery. 14:44 Exactly. They put their custom dot task file on their phone, load it using this feature in the gallery app, then they can interact with it maybe through the prompt lab interface if it's a text model and critically use those performance insights we just talked about. They can measure the TTFT, decode speed, latency of their own model running on a real device using Google's optimized liturgy engine. It turns the gallery into a really practical practical technical tool for iterating on and benchmarking custom on device AI models. Gotcha. 15:13 And the last feature listed is developer resources with quick links to model cards and source code. Yeah. These are pointers for people who wanna go deeper technically. Model cards are basically standardized data sheets for AI models. They provide metadata, technical details about the architecture, training data, how it's meant to be used, limitations, performance specs, really important documentation if you're gonna use or evaluate a model seriously. 15:42 Practice and super valuable, technically. It lets you see exactly how they built the app. How does it use the Google AI Edge APIs? How does it integrate litter too? How does it load models, measure performance, build the UI? 15:54 For anyone wanting to build their own on device AI app using these tools, studying this gallery code is like getting a practical working blueprint. It shows you how it's done. Okay. That covers the features, the things to try. Let's shift to the under the hood section and the reedom, the foundational tech making it all work. 16:10 Right. The building blocks. First, it lists Google AI Edge again. As we said, that's the big umbrella. The core APIs and tools for on device ML. 16:20 It's Google's whole technical offering for enabling this kind of development. The gallery app is built using this framework. It's the foundation. Then Liter t. We've mentioned it a few times. 16:29 Yeah. And it's probably the most crucial piece for actually running the models. Mhmm. The lightweight runtime for optimized model execution. We talked about why lightweight matters minimal resource use on the device, and optimized is about speed and efficiency. 16:44 LeaderTeddy is engineered to squeeze maximum performance from the specific hardware. It knows how to use the CPU efficiently, how to offload work to the GPU using graphics APIs for compute, and critically, how to leverage any specialized NPUs, neural processing units the device might have. These NPUs are custom built just for AI math. Later t's job is to figure out the best way to run the model's calculations across all these available hardware units, minimizing time and power. It's doing complex stuff like graph optimization, fusing operations together. 17:13 That's where the device speed comes from. Okay. Then there's a specific LLM inference API. Why is separate API just for LLMs? Good question. 17:23 It suggests that running these huge large language models locally has unique technical challenges beyond general ML tasks, warranting its own dedicated interface. LLMs are just beasts in terms of size and the sheer amount of computation needed, especially for things like the attention mechanism. So the specialized API probably handles LLM specific technical details, things like efficiently loading massive model files that might not even fit in RAM all at once, maybe using memory mapping or streaming parts of the model. It likely includes highly optimized code for the specific operations common in transformer models tuned for mobile chips. It probably manages the whole token by token generation loop, integrates advanced techniques like quantization or sparsity that work well for LLMs. 18:05 Essentially, it provides a cleaner, higher level way for developers to run LLMs on device without getting bogged down in all the low level optimization details. It abstracts away a lot of that complexity. Makes sense. And lastly, Hugging Face integration is listed again as a technology. Right. 18:19 Reinforcing that connecting to Hugging Face isn't just a user feature. It's a core technical specific piece of engineering connecting their on device runtime literary to that huge external model ecosystem. Okay. That paints a clear picture of the stack. Now for someone listening who actually wants to try this Yeah. 18:45 What does the re Adam say about getting started? The get started section is pretty direct, technically speaking. Step Step one, download the app via the latest APK. APK. That's the Android app file. 18:56 Right? Not from the Play Store. Exactly. APK is Android package kit. It's the raw installation file. 19:02 So you're downloading it directly and maybe from the GitHub releases page. Step two is install and explore, but here's the key technical bit. Mhmm. It points you to the project Wiki for detailed guides because installing an APK directly often requires enabling specific developer settings on your phone, like allowing installs from unknown sources, and there might be other permissions or setup steps. The Wiki is where you'll find those necessary technical instructions to actually get the app installed and running successfully. 19:30 It's the detailed manual for the APK. And what about the project status? Is it considered finished? Oh, definitely not. The reDME is very upfront about this. 19:39 It clearly states this is an experimental alpha release. From a technical perspective, alpha means it's early days. You should expect bugs. Performance might be uneven across different phones. Features might change drastically or be incomplete. 19:52 It's really intended for testing, for gathering that early feedback, not for relying on in production, knowing its alpha sets the right technical expectations. And being alpha, that user feedback must be goal for the developers. Right? Right. Especially with all the different Android hardware out there. 20:07 Absolutely crucial. On device AI performance is so tied to the specific chip in the phone. The readme explicitly asks for feedback. Found a bug. Report it here. 20:15 And have an idea. Suggest a feature. Bug reports tell them what's breaking on which devices. Feature requests show what people actually want to do with this tech. This feedback directly shapes shapes the technical direction, improving leader, refining the LLM API, fixing the gallery app itself. 20:31 It's vital for maturing an experimental project like this. The readatemy also mentions the license, licensed under the Apache license version two point o. This is a standard permissive open source license. Technically, this means the source code for the gallery app is available on GitHub for anyone to look at, use, modify, and even redistribute as long as they follow the license terms like keeping the copyright notice. It promotes technical transparency and lets the community learn from or even contribute to the project. 20:59 And there are other links too. Yeah. The useful link section points to more technical resources, the project Wiki, obviously, for setup, a a Hugging Face Literate community link, probably a forum or group for discussing litter in incompatible models, an LLM inference guide for Android, which sounds like deeper technical docs for that specific API, and the main Google AI Edge documentation, the full reference for the entire framework, lots of pathways there for deeper technical exploration. Oh, and it lists the project contributors, Jing Jin and Yasir Modak. Good to acknowledge the engineers behind the work. 21:33 And one final technical detail, the language used is Kotlin 100%. Kotlin's the modern language Google promotes for Android development. So the whole app is built using Kotlin, leveraging its features for building Android apps. Okay. Pulling it all back together then, it really seems like this gallery app is designed to take all that complex underlying tech AI edge, litter the APIs, and make it tangible through these things to try. 21:56 That's exactly it. It's a bridge. On one side, you have the deep technical engineering challenges of making powerful AI run efficiently on, you know, resource constrained phones. That's the work in AI Edge, Literty, the LLM API. On the other side, you have users, developers, researchers who wanna see and experiment with what's possible now with AI on their own device. 22:16 The features, choosing models, asking images, questions, the prompt lab, AI chat, bringing your own model, seeing the performance metrics, these are the concrete ways you interact with the results of all that low level technical work. The app makes these sophisticated capabilities running diverse models offline, handling multimodal input, managing conversations, benchmarking performance accessible. You can feel how fast an LLM responds locally or see a VLM describe a picture without hitting the cloud. It's a hands on demonstration of the technical state of the art for on device AI using Google's tools. So after digging into the technical details of this gallery via the RIA DM, what's a final provocative thought or question that leaves you with? 22:55 Something for our listeners to chew on. Well, this whole deep dive really hammers home a potential shift, doesn't it? We've seen technically how complex AI models capable of generation, understanding images, and text can be optimized to run entirely locally offline with measurable speed. So the question I think it raises is, if this level of sophisticated AI can technically live and execute entirely on your personal device without needing the cloud for every interaction, what new kinds of applications or user experiences does that truly unlock? What becomes practical now that just wasn't feasible when latency, privacy concerns, or the need for constant connectivity were barriers because everything relied on sending data out. 23:35 Think about real time creative tools or truly private personal assistance or powerful accessibility features that work instantly anywhere. The technical foundation demonstrated here, putting capable AI right onto the hardware in your pocket, suggests a future where AI isn't just a service you connect to, but an inherent always on capability of the device itself. It really makes you wonder what entirely new things become possible when that local AI power is readily available. Thank you for listening in. Subscribe and follow Colaberry on social media links in the description, and check out our website, www.colaberry.ai backslash podcast for more insights like this.