gemma 4 e2b qat: google squeezes an ai model to 0.84 gb

Nick Trenkler 7 Jun 2026 1 min read

Illustrated tattooed hand holding a glowing QAT chip under a spotlight in a graffiti-covered server room

google keeps shrinking their open models

just released new versions of their gemma 4 ai models yesterday, optimized using a technique called quantization-aware training (qat)

how is qat different from normal compression?

normal compression (ptq) quantizes the model after training – which often hurts performance. qat bakes quantization into the training process itself, so the model learns to stay accurate even at smaller sizes.

the smallest model, gemma 4 e2b, has been squeezed down to 0.84 gb of memory in text-only mode. what's its normal size?

• full size gemma 4 e2b (bf16): 11.4 gb
• quantized gemma 4 e2b (q4_0 / 4-bit): 2.9 gb
• mobile version (qat): 1.1 gb
• mobile text-only version (qat): 0.84 gb

that's roughly a 91% reduction

what they did to get there:

- pre-calculated activation scaling instead of doing it on the fly
- structured data to fit mobile chip architecture natively
- compressed token generation layers to 2-bit while keeping reasoning layers higher precision
- optimized the vocabulary and short-term memory (kv cache) to allow longer conversations in less ram

at 0.84 gb, gemma 4 e2b qat is now the most lightweight model in its class. but here's the interesting part – qat is a technique, not a gemma exclusive. apply it to something like minicpm 5 1b (currently 2 gb) and you could shrink it even further

Comparison table of Gemma 4 E2B QAT vs MiniCPM 5 1B, Qwen 3.5 2B, and LFM 2.5 1.2B showing params, size in GB, and AAI index scores

gemma 4 e2b qat: google squeezes an ai model to 0.84 gb

Read next

claude opus 5 vs opus 4.8 vs opus 4.7 vs opus 4.6 - on italian architecture

openrouter is basically chinese now. 9 of the top 10 models by token volume

5 rounds, 5 layers of the agent stack: payments to the ai ceo seat

Stay in the loop

Read next

claude opus 5 vs opus 4.8 vs opus 4.7 vs opus 4.6 - on italian architecture

openrouter is basically chinese now. 9 of the top 10 models by token volume

5 rounds, 5 layers of the agent stack: payments to the ai ceo seat