may 2026
there’s a pattern i keep seeing. a team builds something with an LLM, it ships, and then they celebrate because the engagement numbers look good. users are interacting with it. session lengths are up. the product is working.
except it isn’t, really. the thing about AI systems is that engagement doesn’t mean the same thing it meant for web apps. someone can be highly engaged with a product that’s consistently giving them subtly wrong answers — they just don’t know it yet.
for a search engine or a social feed, time-on-site and click-through rates are decent proxies for value delivery. the loop is tight. you search, you find or you don’t, you come back or you don’t. the signal is noisy but it’s there.
for an AI product, the loop is much longer. someone asks a question, gets a confident-sounding answer, acts on it, and discovers weeks later that the answer was incomplete or just wrong. by then the engagement data looks great. the business metrics are green. but trust is quietly eroding.
the metrics we actually need are harder to collect. how often does a user correct the model? how often do they verify output before acting? how often do they leave feeling less certain than when they arrived? those are the signals that tell you whether the thing is working.
the teams i’ve seen navigate this well share one thing: they’re obsessed with failure modes, not just success cases. they run adversarial tests. they talk to users who churned. they try to break their own thing before users do.
measurement shapes what you optimise for. if you’re measuring the wrong thing, you’ll build the wrong thing, very efficiently, for a long time before you notice.