the gap between benchmarks and actually useful

mar 2026

every few weeks there’s a new model at the top of the leaderboards. the numbers are genuinely impressive — reasoning benchmarks, coding evals, multilingual performance. the rate of improvement is real.

and then you use it for something specific — something that matters for your actual work — and the gap between benchmark performance and day-to-day usefulness becomes very apparent very quickly.

i think this is partly a problem of benchmark design. benchmarks are optimised to be measurable and comparable, which means they tend to test discrete, closed-form problems with clear correct answers. real work is messier. it’s about ambiguity tolerance, consistency across a long session, knowing when to say ‘i don’t know’, and adapting to context that doesn’t fit neatly into a prompt.

there’s also a distribution shift problem. the benchmarks are public, which means they’re increasingly in training data. a model can improve benchmark scores without improving general capability — it just needs to get better at the specific patterns those benchmarks test.

the thing i find myself caring about more than benchmark position: does the model stay useful as the conversation gets longer? does it handle ambiguous instructions gracefully, or confidently go off in the wrong direction? does it push back when i’m asking for something that doesn’t make sense?

those are harder to measure, which is exactly why they’re not in the leaderboards. but they’re the things that actually determine whether i reach for a tool every day or eventually stop opening it.

← back to memos