Skip to content

opus 4.8 beats mythos on agentic computer use

pulse Illustration of a man in a lab coat labeled Opus 4.8 writing benchmark scores in a book marked OSWorld in a server room

osworld-verified bench tests ai on real computer tasks – opening apps, filling forms, navigating browsers. each task is pass/fail. the % is simply how many tasks the model completed successfully out of the total

83.4% vs 79.6% – a nearly 4% gap on pass/fail tasks isn't noise. it's a pattern

public models are closing the gap faster than anyone expected. if anthropic doesn't open mythos to the mass market soon, the window closes

and according to the opus 4.8 release notes – it already is. mythos goes public in weeks

Benchmark table comparing Opus 4.8 and Mythos on agentic coding, multidisciplinary reasoning, and agentic computer use scores

Stay in the loop

Get the latest AI news delivered to your inbox weekly

Thanks for subscribing!