Skip to content

how tokenspeed made 397b qwen3.5 run at 580 tps?

analysis Illustrated AI chip on wheels racing with flames and 580 tps text, representing Tokenspeed's inference speed breakthrough

580 tokens per second is genuinely crazy. for context, if qwen3.5 ran at this speed by default, it would rank #2 on artificial analysis's speed leaderboard – sitting between mercury 2 (903 tok/s) and granite 4.0 h small (337 tok/s). and unlike those two, qwen3.5 is a 397b-parameter flagship model, not a small or specialized one

Speed leaderboard table with Qwen3.5 via Tokenspeed highlighted at 580 tps, ranked between Mercury 2 and Granite 4.0

how they did it

Throughput vs latency chart for Qwen3.5-397B on B200 GPUs showing best per-user latency at 580 tok/s across TP configurations

tokenspeed is an open-source inference engine (the software layer that runs the model) built specifically for agent-style workloads. it stacks several tricks:

  • reuse past work – agent conversations repeat the same context constantly (system prompts, history). tokenspeed caches it and skips redoing 90%+ of repeated processing
  • point to data instead of copying it – during speculative decoding (predicting likely next words in parallel to go faster), the old method copied huge chunks of internal state every step. tokenspeed just updates a pointer (a reference saying "the answer is over there") instead of moving the data
  • run things in parallel – gpus can handle multiple jobs at once. tokenspeed splits independent work across parallel lanes so nothing waits its turn
  • bundle small operations – fuses many tiny gpu calls into single bigger ones, cutting overhead
  • keep the gpu fed – restructured the workflow so the cpu (regular processor) never pauses to wait on the gpu
  • split workload across machines – input reading and answer generation can run on separate servers, connected by rdma (direct memory transfers between machines, skipping normal os overhead)

but it's not cheap

NVIDIA DGX B200 product page showing server hardware described as the foundation for AI factory infrastructure

this result was achieved on 8 nvidia b200 gpus – the latest top-tier hardware. pricing roughly looks like:

  • cloud rental: $2.25/hour (3-year reserved) up to ~$14.24/hour on-demand, per gpu
  • buying outright: $30,000–$40,000 per gpu

so the setup behind 580 tok/s costs somewhere between ~$140/hour to rent or ~$240k–$320k to own – and that's just the gpus, not the surrounding infrastructure


the bigger lesson

this is the interesting part: the model itself didn't get smarter or faster. qwen3.5 is the same model running everywhere else. what changed is the software wrapping it and the hardware underneath it. the 580 tok/s result is a cocktail – model + inference engine + gpu generation + parallelism strategy – and any single ingredient missing kills the numberwhich means ai performance isn't really about the model anymore. it's about the whole stack working together. and most of that stack is still earlyso the real takeaway: we haven't seen what these models can actually do yet. today's benchmarks reflect today's stack – not the model's ceiling. as engines mature, hardware iterates, and optimization patterns spread to cheaper gpus, the same models will run dramatically faster and cheaper. the true power of current ai is still locked behind infrastructure that hasn't caught up yet

Stay in the loop

Get the latest AI news delivered to your inbox weekly

Thanks for subscribing!