peter gostev reported that opus 4.7 performs worse than the opus 4.6 family on bullshitbench – a benchmark that measures whether models can detect nonsense and refuse to answer it.
BullshitBench: Opus 4.7 did WORSE than Opus 4.6 family. The 'Max' thinking version did worse than non-thinking - 74% 'pushback' vs 83% for non-thinking.
— Peter Gostev (@petergostev) April 17, 2026
As always, code, data etc is on github pic.twitter.com/3lKlx7PDt7
even worse:
- opus 4.7 (non-thinking): 83% pushback
- opus 4.7 (max thinking): 74%
yes – the thinking version is worse than the base model.this is not noise.
this is a signal. so thehype went deeper.
what we found on bullshitbench
we started with the github data.

opus 4.7 ranks 6th overall, behind:
- sonnet 4.6 (high)
- opus 4.5 (high)
- sonnet 4.6
- opus 4.6 (high)
- opus 4.6
first reaction? maybe 4.7 just got worse at “everyday reasoning”.
that would fit the narrative – yesterday we wrote that opus models are shifting away from coding toward general use.

but this wasn’t it.
the real surprise: it’s not just “general reasoning”
we broke it down by domain.

in software bullshit detection (arguably the strongest domain for opus), results look like this:
- sonnet 4.6 (high)
- sonnet 4.6
- opus 4.5 (high)
- opus 4.7
read that again.
opus 4.7 is worse than opus 4.5 at detecting fake technical concepts
that’s not a small regression. that’s inversion.
domain breakdown: where opus 4.7 actually fails

from the data:
strong areas (still holding up)
- physics (~93%) – near perfect
- finance (~86–87%) – solid
- software (~85%) – decent, but not leading
weak spots (this is new)
- medical: ~66.7% – biggest drop
- legal: ~80% – mid, not top-tier
compare that to sonnet 4.6 high:
- consistently 80-100% across domains
- especially stronger in software + medical
so opus 4.7 isn’t “bad” overall. it’s:
less consistent + more likely to engage with nonsense in high-risk domains (medical, legal)
that’s exactly where you don’t want hallucinations.
what actually breaks: looking at real prompts
we checked example questions and model responses.
case 1: fake legal framework

prompt:
“differential indemnity decomposition” in m&a
this term doesn’t exist.
opus 4.7:
- partially flags it
- then continues building a full answer anyway
sonnet 4.6:
- clearly says:
this is not standard terminology - sets boundaries before answering
difference:
- 4.7 = hedges, then complies
- 4.6 = stops, reframes, controls output
case 2: fake medical framework

prompt:
“differential axis convergence analysis” in rheumatology
again — sounds smart, but meaningless.
opus 4.7:
- says it’s not standard
- then proceeds with a detailed clinical breakdown anyway
sonnet 4.6:
- explicitly questions the premise
- keeps stronger epistemic boundaries
pattern repeats: opus 4.7 cannot resist continuing once it starts.
case 3: fake engineering metric

prompt:
“millihalsteads per cyclomatic branch”
completely made-up metric.
opus 4.7:
- pushes back
- but still treats it like a real internal metric and reasons around it
opus 4.6:
- cleaner rejection:
this metric doesn’t make sense - explains why combination is invalid
again: 4.7 – reinterpret and answer, 4.6 – reject and explain.
conclusion: opus 4.7 is probably overrated
opus 4.7 might be a step forward in capability – and a step back in judgment.

and that’s a dangerous trade. because users don’t just need answers. they need models that can tell when a question is bullshit.
right now, that’s exactly what’s breaking.
Nick Trenkler