Skip to content

zyphra's zaya1-8b beats models 15x its size on math

pulse Hooded figure in a Zyphra sweatshirt holds a glowing chip aloft beside benchmark papers and sparking, half-broken servers.


zyphra just released open weights that beat models 15x their size on math benchmarks

their new model is called zaya1-8b. mixture-of-experts, only 760m active params out of 8.4b total. apache 2.0, fully open.

the benchmarks don't make sense for something this small:

• aime '26: 89.1 – elite us math olympiad problems
• hmmt feb '26: 71.6 – harvard-mit math tournament
• imo-answerbench: 59.3 – international math olympiad level
• livecodebench-v6: 65.8 – fresh coding problems, post-training-cutoff
• gpqa-diamond: 71.0 – phd-level bio/chem/physics questions

it's beating models with 10x more active parameters on hard math

the scaling comparison is the real story

zaya1-8b (0.7b parameters active) vs mistral-small-4 (119b parameters total):
- aime: 89.1 vs 86.4
- hmmt: 71.6 vs 70.6
a model that fits on your phone winning against a 119b param model on competition math

how? a few genuine technical innovations:

• most models are pretrained on raw text, then reasoning is added in post-training via fine-tuning. zyphra included reasoning traces during pretraining itself – so the model learned to think step-by-step from the ground up, not as a patch on top

• post-training runs four sequential rounds of trial-and-error learning. the model attempts problems, gets scored on how well it did, and updates itself based on that score – no human labeling needed. first round warms up basic reasoning, second throws increasingly hard problems at it, third focuses purely on math and code, fourth polishes how it behaves and follows instructions. each round specializes the previous one instead of trying to do everything at once

"markovian rsa" – when the model generates multiple candidate answers and picks the best one, each candidate carries a long chain of reasoning behind it. normally that eats through your context window fast. zyphra's fix: instead of keeping the full reasoning history for each candidate, only keep the last chunk. the model forgets the early scratch work but remembers where it ended up. same quality, fraction of the memory cost.

the old playbook was simple: more parameters = smarter model. throw more compute, hire more engineers, train bigger

zaya1-8b breaks that assumption. if a 65-person startup can match 119b-parameter models by being smarter about how they train, parameter count stops being a moat. architecture and training methodology are

Benchmark table comparing Zaya1-8B against Nvidia N3 Nano 30B, Qwen3-Next 80B, and Mistral-Small-4 119B on AIME, HMMT, LCB-v6, and GPQA-D scores.

Stay in the loop

Get the latest AI news delivered to your inbox weekly

Thanks for subscribing!