opus 4.7 is worse than opus 4.6 at spotting fakes

Nick Trenkler 17 Apr 2026 3 min read

Cover image for an analysis of why Claude Opus 4.7 accepts fake concepts more often than Opus 4.6 on BullshitBench

peter gostev reported that opus 4.7 performs worse than the opus 4.6 family on bullshitbench – a benchmark that measures whether models can detect nonsense and refuse to answer it.

BullshitBench: Opus 4.7 did WORSE than Opus 4.6 family. The 'Max' thinking version did worse than non-thinking - 74% 'pushback' vs 83% for non-thinking.

As always, code, data etc is on github pic.twitter.com/3lKlx7PDt7
— Peter Gostev (@petergostev) April 17, 2026

even worse:

opus 4.7 (non-thinking): 83% pushback
opus 4.7 (max thinking): 74%

yes – the thinking version is worse than the base model.this is not noise.

this is a signal. so thehype went deeper.

what we found on bullshitbench

we started with the github data.

Overall BullshitBench v2 ranking where Claude Opus 4.7 drops to 6th place behind Sonnet 4.6, Opus 4.5 High, and Opus 4.6 variants

opus 4.7 ranks 6th overall, behind:

sonnet 4.6 (high)
opus 4.5 (high)
sonnet 4.6
opus 4.6 (high)
opus 4.6

first reaction? maybe 4.7 just got worse at “everyday reasoning”.

that would fit the narrative – yesterday we wrote that opus models are shifting away from coding toward general use.

Social media post preview showing an article titled “opus 4 to 4.7: the quiet shift no one’s pointing at,” with a glowing star-like symbol under a neon crown and the number 4.7 displayed below.

but this wasn’t it.

the real surprise: it’s not just “general reasoning”

we broke it down by domain.

Software domain ranking where Claude Sonnet 4.6 leads at 92.5% and Opus 4.7 drops to 4th place behind older Opus 4.5

in software bullshit detection (arguably the strongest domain for opus), results look like this:

sonnet 4.6 (high)
sonnet 4.6
opus 4.5 (high)
opus 4.7

read that again.

opus 4.7 is worse than opus 4.5 at detecting fake technical concepts

that’s not a small regression. that’s inversion.

domain breakdown: where opus 4.7 actually fails

BullshitBench detection rates by domain showing Opus 4.7 dropping to 66.7% in medical — the weakest spot across Claude models

from the data:

strong areas (still holding up)

physics (~93%) – near perfect
finance (~86–87%) – solid
software (~85%) – decent, but not leading

weak spots (this is new)

medical: ~66.7% – biggest drop
legal: ~80% – mid, not top-tier

compare that to sonnet 4.6 high:

consistently 80-100% across domains
especially stronger in software + medical

so opus 4.7 isn’t “bad” overall. it’s:

less consistent + more likely to engage with nonsense in high-risk domains (medical, legal)

that’s exactly where you don’t want hallucinations.

what actually breaks: looking at real prompts

we checked example questions and model responses.

case 1: fake legal framework

Opus 4.7 hedges on a fake M&A legal term and continues answering, while Sonnet 4.6 stops and reframes before responding

prompt:

“differential indemnity decomposition” in m&a

this term doesn’t exist.

opus 4.7:

partially flags it
then continues building a full answer anyway

sonnet 4.6:

clearly says:
this is not standard terminology
sets boundaries before answering

difference:

4.7 = hedges, then complies
4.6 = stops, reframes, controls output

case 2: fake medical framework

Opus 4.7 flags a fake medical framework but continues with a full clinical breakdown, while Sonnet 4.6 questions the premise first

prompt:

“differential axis convergence analysis” in rheumatology

again — sounds smart, but meaningless.

opus 4.7:

says it’s not standard
then proceeds with a detailed clinical breakdown anyway

sonnet 4.6:

explicitly questions the premise
keeps stronger epistemic boundaries

pattern repeats: opus 4.7 cannot resist continuing once it starts.

case 3: fake engineering metric

Opus 4.7 treats the fake engineering metric millihalsteads as real and reasons around it, while Opus 4.6 cleanly rejects it

prompt:

“millihalsteads per cyclomatic branch”

completely made-up metric.

opus 4.7:

pushes back
but still treats it like a real internal metric and reasons around it

opus 4.6:

cleaner rejection:
this metric doesn’t make sense
explains why combination is invalid

again: 4.7 – reinterpret and answer, 4.6 – reject and explain.

conclusion: opus 4.7 is probably overrated

opus 4.7 might be a step forward in capability – and a step back in judgment.

BullshitBench v2 detection rate over time across releases, showing Claude models from Anthropic leading over OpenAI GPT and Google Gemini

and that’s a dangerous trade. because users don’t just need answers. they need models that can tell when a question is bullshit.

right now, that’s exactly what’s breaking.

opus 4.7 is worse than opus 4.6 at spotting fakes

what we found on bullshitbench

the real surprise: it’s not just “general reasoning”

domain breakdown: where opus 4.7 actually fails

what actually breaks: looking at real prompts

conclusion: opus 4.7 is probably overrated

Read next

gpt 5.6 sol pro vs claude fable 5 vs grok 4.5 vs glm 5.2

grok 4.5 vs fable 5 vs gpt 5.5 vs glm 5.2

muse image vs gpt image 2 vs nano banana 2 vs reve 2.0

what we found on bullshitbench

the real surprise: it’s not just “general reasoning”

domain breakdown: where opus 4.7 actually fails

what actually breaks: looking at real prompts

conclusion: opus 4.7 is probably overrated

Stay in the loop

Read next

gpt 5.6 sol pro vs claude fable 5 vs grok 4.5 vs glm 5.2

grok 4.5 vs fable 5 vs gpt 5.5 vs glm 5.2

muse image vs gpt image 2 vs nano banana 2 vs reve 2.0