On June 16, OpenAI shipped something with a boring name and a big implication for anyone building on top of language models. They called it Deployment Simulation. Before a new candidate model goes out, they replay real past conversations through it and compare how it behaves against the version already in production. The goal is simple. Catch the regressions before users do.
If you ship features powered by an LLM, that idea should sound familiar and a little uncomfortable. Familiar because it is just regression testing. Uncomfortable because most teams building on these models do not do it.
The upgrade you cannot see coming
Here is the trap. A new model drops. The benchmarks look great. It is cheaper, faster, and scores higher on every public eval. You swap the model string in your config, run a few manual prompts, everything looks fine, and you ship.
Then the support tickets start. The new model got better in aggregate and worse at the three things your product actually depends on. Maybe it is more verbose and your UI assumed short answers. Maybe it stopped returning valid JSON on edge cases. Maybe its refusal behaviour changed and now it declines a request your old prompt handled fine.
None of that shows up in a benchmark. It shows up in your traffic.
What replaying real conversations actually catches
Public evals measure general capability. They tell you almost nothing about your prompts, your tools, your formatting contract, or the strange way your users phrase things. A model can climb the leaderboard and quietly break the one workflow that pays your bills.
Replaying real conversations flips the question. Instead of asking "is this model smarter", you ask "does this model do my job at least as well as the one I am already running". Those are different questions, and only the second one keeps your product stable.
In practice, a replay surfaces the things that hurt:
- Format drift in JSON, markdown, or response length
- Tone and verbosity shifts that break your UI assumptions
- Changed refusal patterns on requests that used to work
- Different tool calls or argument shapes
- Latency and cost deltas on your real prompt sizes, not toy ones
You do not need OpenAI''s setup to do this
The principle is cheap to copy even if the infrastructure is not.
- Log the inputs and outputs of real interactions, sanitised. A few hundred is enough to start.
- Build a fixture set that mixes everyday cases with your nastiest edge cases.
- On every model change, run the candidate over that set and diff the output against the current model.
- Score what matters to you: schema validity, length, exact tool calls, or a rubric graded by an LLM judge.
- Gate the upgrade on that diff, not on the launch post.
This is roughly half a day of engineering. It pays for itself the first time it stops a bad rollout from reaching a customer.
The takeaway
Model upgrades are not free wins. They are dependency changes. You would never bump a major version of a library in production without running your tests, so treat the model the same way. The teams that stay stable through the next year of weekly model releases will not be the ones chasing every shiny launch. They will be the ones who can confidently swap a model on Friday because they replayed Thursday''s traffic first.
We are here to help founders and teams design and build digital products that are built to scale with you, not slow you down. If you are looking to build something, get in contact with us today!