tiny 7M-param model beats openai's o3 on arc prize

Nick Trenkler 1 May 2026 2 min read

Suited man drives a chisel into a cracked stone wall carved with Scale, More, Params as green light bursts through behind.

turns out we've been building ai wrong – a @ycombinator visiting partner explained on a recent podcast episode of decoded

yc partners francois chaubard (@FrancoisChauba1) and ankit gupta (@agupta) broke down two research papers that challenge everything we assumed about ai scaling. here are the takeaways for builders, by thehype

1. about the "bigger is better" myth:

• two new research papers – hierarchical reasoning model (hrm) and tiny recursive model (trm) – proved that a tiny model can beat o3, one of openai's most powerful reasoning models. they were compared on arc prize, a benchmark testing abstract visual reasoning and problem solving. the tiny model was trained on just a thousand puzzles. o3 was trained on virtually the entire internet. o3 scored literally zero. the tiny one scored 87%. and while o3 likely runs on hundreds of billions of parameters, the tiny model had just 7 million.

• the real advantage comes from smart architecture choices, not raw compute – specifically, letting the model loop and refine its own thinking rather than just going deeper with more layers and parameters

• small specialized models trained on a specific narrow task can dramatically outperform giant general-purpose ones, which changes how you should think about scoping ai features in your product

2. about the real limits of current ai tools:

• tools like chatgpt do one single computational pass per token – meaning there's a hard provable ceiling on how much reasoning they can do, no matter how big they get

• chain of thought prompting (when ai shows its step-by-step thinking) helps but is ultimately bounded by what exists in the training data – the model cannot discover genuinely new reasoning paths, it can only recombine what humans have already written

• for problems that require many sequential steps of logic – complex planning, constraint satisfaction (think sudoku: you can't fill everything at once, each step depends on the last), multi-step analysis – current llms will silently fail or hallucinate rather than admitting they've hit their limit

3. about what's actually new and worth watching:

• recursive models like hrm and trm solve problems by maintaining a working memory that gets refined in loops, similar to how a human would iterate on a hard problem rather than answering instantly

• when chatgpt reasons step by step, it has to write out every thought as actual words – slow, limited, and constrained to patterns seen in human text. hrm and trm instead think silently inside the model, in raw numerical vectors. like the difference between someone who must narrate every thought out loud versus one who thinks quietly and only speaks when ready – the silent version is richer, faster, and figures out its own reasoning strategy without being shown how

• the most promising near-term direction is combining the broad knowledge of large llms with small recursive reasoning models on top – letting the big model handle language and context while the small one handles hard logical reasoning

• this space is still wide open – even the researchers admit they don't fully understand why these techniques work, which means there's real first-mover opportunity for builders willing to experiment now

A 7-million parameter model outperforming models a thousand times its size on tasks like ARC Prize. That's what recursive reasoning unlocks.

In this episode of Decoded, YC's @agupta and @FrancoisChauba1 break down two recent papers on recursive AI models, HRMs and TRMs, that are… pic.twitter.com/slZh2sfHlE
— Y Combinator (@ycombinator) May 1, 2026

tiny 7M-param model beats openai's o3 on arc prize

Read next

gpt 5.6 sol pro vs claude fable 5 vs grok 4.5 vs glm 5.2

grok 4.5 vs fable 5 vs gpt 5.5 vs glm 5.2

muse image vs gpt image 2 vs nano banana 2 vs reve 2.0

Stay in the loop

Read next

gpt 5.6 sol pro vs claude fable 5 vs grok 4.5 vs glm 5.2

grok 4.5 vs fable 5 vs gpt 5.5 vs glm 5.2

muse image vs gpt image 2 vs nano banana 2 vs reve 2.0