Lessons from shipping an AI agent

The gap between a notebook agent and a production agent is wider than it looks. In the notebook, every prompt is a love letter. In production, every prompt is a contract that someone will eventually misuse, ignore, or sue you over.

Three things surprised me. First: latency budgets dwarf clever prompting. A two-call agent that runs in 800ms beats a five-call agent that finishes in 4 seconds. Users don’t read agent transcripts; they read clocks. Second: retrieval quality is the entire game. A model is only as smart as the chunks you hand it, and indexing decisions you made on day one will haunt you for months. Third: graceful degradation. The agent should know when to give up and route to a human, and it should do that loudly enough that you notice when it starts happening more often.

We built rubric-based eval harnesses, not because they’re scientifically pure, but because they catch the regressions that wreck users’ weeks. The boring infrastructure pays the rent.