new local model for mobile outperforms gemma and qwen
minicpm-v 4.6 is a 1.3b vision-language model designed to run entirely on-device, no cloud required. it works on ios, android, and huawei's harmonyos, and on desktop via ollama or llama.cpp
despite its size, it beats larger models on standard multimodal benchmarks, while using 19x fewer tokens (the units of data the model processes when reading inputs and generating outputs). fewer tokens means less compute, lower latency, and longer battery life on mobile
the efficiency comes from a new image encoding architecture called llava-uhd v4. instead of processing high-resolution images at full cost all the way through, it compresses visual information early inside the vision encoder, cutting total mathematical operations (flops) by 55%
it also switches between two compression modes – 4x for accuracy-sensitive tasks, 16x for speed-sensitive ones – within the same model
time-to-first-token (the delay before the model starts responding) sits at 75ms even on large high-res images, which is 2.2x faster than qwen3.5-0.8b. throughput – how many tokens it can generate per second – is about 1.5x higher than qwen on the same hardware
it supports quantized formats like gguf and awq, meaning the model weights can be compressed further for even leaner deployment on consumer gpus or phone chips
so quietly, ai is optimizing itself into your pocket. models are getting smaller, faster, and more capable all at once – and minicpm-v 4.6 is a good example of where that trend is heading

1/5 MiniCPM-V 4.6 (1.3B) is now live 🚀🚀
— OpenBMB (@OpenBMB) May 11, 2026
High-res visual processing, optimized for consumer-grade and mobile hardware. We’ve leveraged the latest LLaVA-UHD v4 technique to cut vision encoding costs by 55%, enabling native edge deployment with extreme efficiency.
🔥 Beats… pic.twitter.com/tshq21OguO
Nick Trenkler