Skip to content

opus 4.7 is worse than opus 4.6 at spotting fakes

analysis Cover image for an analysis of why Claude Opus 4.7 accepts fake concepts more often than Opus 4.6 on BullshitBench


peter gostev reported that opus 4.7 performs worse than the opus 4.6 family on bullshitbench – a benchmark that measures whether models can detect nonsense and refuse to answer it.

even worse:

  • opus 4.7 (non-thinking): 83% pushback
  • opus 4.7 (max thinking): 74%

yes – the thinking version is worse than the base model.this is not noise.

this is a signal. so thehype went deeper.


what we found on bullshitbench


we started with the github data.

Overall BullshitBench v2 ranking where Claude Opus 4.7 drops to 6th place behind Sonnet 4.6, Opus 4.5 High, and Opus 4.6 variants

opus 4.7 ranks 6th overall, behind:

  1. sonnet 4.6 (high)
  2. opus 4.5 (high)
  3. sonnet 4.6
  4. opus 4.6 (high)
  5. opus 4.6

first reaction? maybe 4.7 just got worse at “everyday reasoning”.

that would fit the narrative – yesterday we wrote that opus models are shifting away from coding toward general use.

Social media post preview showing an article titled “opus 4 to 4.7: the quiet shift no one’s pointing at,” with a glowing star-like symbol under a neon crown and the number 4.7 displayed below.

but this wasn’t it.


the real surprise: it’s not just “general reasoning”

we broke it down by domain.

Software domain ranking where Claude Sonnet 4.6 leads at 92.5% and Opus 4.7 drops to 4th place behind older Opus 4.5

in software bullshit detection (arguably the strongest domain for opus), results look like this:

  • sonnet 4.6 (high)
  • sonnet 4.6
  • opus 4.5 (high)
  • opus 4.7

read that again.

opus 4.7 is worse than opus 4.5 at detecting fake technical concepts

that’s not a small regression. that’s inversion.


domain breakdown: where opus 4.7 actually fails

BullshitBench detection rates by domain showing Opus 4.7 dropping to 66.7% in medical — the weakest spot across Claude models

from the data:

strong areas (still holding up)

  • physics (~93%) – near perfect
  • finance (~86–87%) – solid
  • software (~85%) – decent, but not leading

weak spots (this is new)

  • medical: ~66.7% – biggest drop
  • legal: ~80% – mid, not top-tier

compare that to sonnet 4.6 high:

  • consistently 80-100% across domains
  • especially stronger in software + medical

so opus 4.7 isn’t “bad” overall. it’s:

less consistent + more likely to engage with nonsense in high-risk domains (medical, legal)

that’s exactly where you don’t want hallucinations.


what actually breaks: looking at real prompts


we checked example questions and model responses.


case 1: fake legal framework

Opus 4.7 hedges on a fake M&A legal term and continues answering, while Sonnet 4.6 stops and reframes before responding

prompt:

“differential indemnity decomposition” in m&a

this term doesn’t exist.

opus 4.7:

  • partially flags it
  • then continues building a full answer anyway

sonnet 4.6:

  • clearly says:
    this is not standard terminology
  • sets boundaries before answering

difference:

  • 4.7 = hedges, then complies
  • 4.6 = stops, reframes, controls output

case 2: fake medical framework

Opus 4.7 flags a fake medical framework but continues with a full clinical breakdown, while Sonnet 4.6 questions the premise first

prompt:

“differential axis convergence analysis” in rheumatology

again — sounds smart, but meaningless.

opus 4.7:

  • says it’s not standard
  • then proceeds with a detailed clinical breakdown anyway

sonnet 4.6:

  • explicitly questions the premise
  • keeps stronger epistemic boundaries

pattern repeats: opus 4.7 cannot resist continuing once it starts.


case 3: fake engineering metric

Opus 4.7 treats the fake engineering metric millihalsteads as real and reasons around it, while Opus 4.6 cleanly rejects it

prompt:

“millihalsteads per cyclomatic branch”

completely made-up metric.

opus 4.7:

  • pushes back
  • but still treats it like a real internal metric and reasons around it

opus 4.6:

  • cleaner rejection:
    this metric doesn’t make sense
  • explains why combination is invalid

again: 4.7 – reinterpret and answer, 4.6 – reject and explain.


conclusion: opus 4.7 is probably overrated


opus 4.7 might be a step forward in capability – and a step back in judgment.

BullshitBench v2 detection rate over time across releases, showing Claude models from Anthropic leading over OpenAI GPT and Google Gemini

and that’s a dangerous trade. because users don’t just need answers. they need models that can tell when a question is bullshit.

right now, that’s exactly what’s breaking.

Stay in the loop

Get the latest AI news delivered to your inbox weekly

Thanks for subscribing!