What we actually measure before shipping OpenAI-style features: tokens, retries, user-visible failures, and when to say no in the UI.
Most production issues with LLM features are not "the model was dumb." They are surprise bills, runaway loops, and users staring at a spinner because timeouts were never defined.
Start with hard caps: max tokens per request, max requests per user per day, and a server-side kill switch. Log prompt length and completion tokens on every call so you can graph cost per feature, not just per month on the vendor invoice.
Retries need rules. If the API returns 429, exponential backoff is fine. If the user taps "try again" twelve times, you should show a message and stop burning credits. Same for streaming: if the connection drops, decide whether you partial-save or discard, and make that behavior consistent.
Guardrails are product decisions. Content filters, PII stripping, and "refuse to answer" paths should be tested like any other acceptance criteria. A green demo in ChatGPT does not mean your wrapper handles edge cases when someone pastes a 40-page PDF.
Finally, ship a fallback. Offline message, cached summary, or human handoff beats a blank error. Users forgive limits; they don't forgive silent failure.
“Good systems are mostly boring parts wired together so they don’t fall over on a Tuesday night.”
Jafrix System