Wiring a language model into a product is a five-minute demo. Keeping that feature reliable, affordable, and correct once real users hit it is a different job entirely — and it's almost all backend engineering, not prompt engineering.

The happy path hides every problem that matters. Providers return 529s and timeouts under load. A generation that fails halfway can leave an orphaned record the UI still tries to render. And every single call costs money you're accountable for, whether or not the user is paying. None of that shows up in the demo; all of it shows up in production.

Reliability is a layer, not a prompt

The fix was a small, boring reliability layer wrapped around the model call. Retries with exponential backoff absorb transient rate-limit and overload errors. Just as important, any record created for an in-flight generation is rolled back if the call ultimately fails — so a user never ends up staring at a half-written message that will never complete.

Framing model calls as an unreliable external dependency — the same way you'd treat a payment gateway or a third-party API — is the mental shift that makes AI features production-grade. You already know how to make flaky dependencies safe; a model is just another one.

Money is a first-class concern

Because generations cost money, usage has to be metered against a quota before it's spent, not after. Premium generations sit behind a valid subscription check, so a lapsed or free account can't quietly run up a bill. Getting this ordering right — check, then spend, then record — is what keeps unit economics from inverting the moment the feature gets popular.

Cache at the session boundary

Finally, a surprising amount of traffic is near-duplicate within a session. Caching at that boundary means many requests that look new can be served without a fresh, expensive round-trip to the model. It's the cheapest reliability and cost win available, and it compounds as usage grows.