Skip to content

minicpm-v 4.6 beats gemma and qwen on-device with 1.3b params

pulse Hooded figure holds up a glowing phone between two graffitied speakers labeled Gemma and Qwen in a neon-lit basement room.


new local model for mobile outperforms gemma and qwen

minicpm-v 4.6 is a 1.3b vision-language model designed to run entirely on-device, no cloud required. it works on ios, android, and huawei's harmonyos, and on desktop via ollama or llama.cpp

despite its size, it beats larger models on standard multimodal benchmarks, while using 19x fewer tokens (the units of data the model processes when reading inputs and generating outputs). fewer tokens means less compute, lower latency, and longer battery life on mobile

the efficiency comes from a new image encoding architecture called llava-uhd v4. instead of processing high-resolution images at full cost all the way through, it compresses visual information early inside the vision encoder, cutting total mathematical operations (flops) by 55%

it also switches between two compression modes – 4x for accuracy-sensitive tasks, 16x for speed-sensitive ones – within the same model

time-to-first-token (the delay before the model starts responding) sits at 75ms even on large high-res images, which is 2.2x faster than qwen3.5-0.8b. throughput – how many tokens it can generate per second – is about 1.5x higher than qwen on the same hardware

it supports quantized formats like gguf and awq, meaning the model weights can be compressed further for even leaner deployment on consumer gpus or phone chips

so quietly, ai is optimizing itself into your pocket. models are getting smaller, faster, and more capable all at once – and minicpm-v 4.6 is a good example of where that trend is heading

Benchmark table of MiniCPM-V 4.6, Qwen3.5-0.8B, and Gemma4-e2B across MMStar, MMBench, MathVista, MMMU, OCRBench, DocVQA, and other vision tasks.

Stay in the loop

Get the latest AI news delivered to your inbox weekly

Thanks for subscribing!