what changed under the hood:
• vision. replaced the encoder with a lightweight embedding module (single matrix multiply + positional embedding + normalization). the llm backbone now handles visual processing directly
• audio. encoder removed entirely. raw audio signal is projected straight into the same token space as text
• inference. ships with multi-token prediction (mtp) drafters for speculative decoding, cutting latency
benchmarks (gemma 3 27b / gemma 4 12b / gemma 4 26b):
- gpqa diamond: 44 / 78.8 / ~80
- bbeh: 18 / 53 / 62 mmlu pro: 67 / 77.2 / 78
- livecodebench: 28 / 72 / 76
- docvqa: 83 / 94.9 / 93
- infovqa: 60 / 88.4 / 90
- mmmu pro: 65 / 69.1 / 72
runs locally on consumer laptops with 16gb vram or unified memory – including macbook m-series
demo source: google
Gemma 4 12B delivers great performance with a small memory footprint and a novel architecture. pic.twitter.com/aUyX7lzj9f
— Google (@Google) June 3, 2026
Addy Crezee