Context windows and decaying constraints
I left a chat open for three days while iterating on a legacy feature. Early in the thread I wrote "Django 3.2, Python 3.8, synchronous codebase" and pasted a couple of real model definitions. By the second day the assistant was suggesting URL path converters and async views. It sounded plausible until I tried to apply the patch and the test suite crashed with import errors. Recent tokens dominate attention. The model kept following the latest snippets and examples rather than the constraints I stated at the top of the thread, so earlier assumptions quietly disappeared.
Small mismatches compound into dangerous runs
I once accepted a generated migration that looked fine in diff form but used the singular table name I had used in an…
Context windows and decaying constraints
I left a chat open for three days while iterating on a legacy feature. Early in the thread I wrote "Django 3.2, Python 3.8, synchronous codebase" and pasted a couple of real model definitions. By the second day the assistant was suggesting URL path converters and async views. It sounded plausible until I tried to apply the patch and the test suite crashed with import errors. Recent tokens dominate attention. The model kept following the latest snippets and examples rather than the constraints I stated at the top of the thread, so earlier assumptions quietly disappeared.
Small mismatches compound into dangerous runs
I once accepted a generated migration that looked fine in diff form but used the singular table name I had used in an example three messages earlier. The migration ran on staging and altered the wrong table. No single suggestion was egregious. Each one leaned on a tiny, hidden assumption: a variable name, a default flag, an environment. Those assumptions stacked until the CI job applied a migration we did not want. The model does not understand side effects; it mirrors whatever context is easiest to predict.
Tool calls fail and the assistant fills gaps
Connecting the model to our log search made things worse before they got better. One query returned a truncated log because of a timeout. The model read the fragment and proposed a code change that addressed a problem the full log did not show. It then confidently wrote a patch, and our automation tried to apply it. The patch did not include any checks on the log shape because the assistant never saw the timeout error. From the outside that looked like bad reasoning. Under the hood it was a missing tool response and no negative path in the prompt.
After a few incidents I added explicit checks on tool outputs. Every tool call now returns a status code and a length field. If a response is partial the assistant is asked to reply with a specific phrase that our orchestrator detects. It is boring plumbing: validate, log, and refuse to act on partial data. That small change prevented a number of follow-on mistakes where the model would invent values to complete a narrative.
Practical guardrails I actually put in place
I stopped treating long threads as a single session and started treating them as mutable documents. Every few turns I post a terse "Current constraints" block with exact versions, project paths, and a tiny example that must be referenced. If the model produces something that ignores that block it is a trigger to reset the thread. We also snapshot prompts and run the generated code in a sandboxed test runner before allowing any deployable artifact out of the pipeline. These are manual frictions but they catch the kinds of drift that a fluent answer will hide.
For verification I now use two cross-checks in parallel. For quick comparisons I run the same prompt across multiple models in a shared workspace so disagreements surface obvious assumptions; I documented that workflow in our team notes and sometimes run it in a dedicated multi-model chat like the one in that workspace. For anything that touches runtime behavior I ask the assistant to cite docs or tests, and then I verify those claims through a focused research pass using a sourcing workflow similar to the research flow. Those two patterns together catch a lot of hidden drift.
Changing how I accept answers
I treat assistant output as a draft that needs machine and human checks. That sounds dull but it prevented another outage: a proposed retry loop that assumed idempotent operations. Tests flagged non-idempotence and the patch never reached production. I now insist on trivial, runnable examples in prompts and require logging around any automated change. The model still helps me iterate faster. What changed is how we verify, where we log, and how quickly we reset context when the thread starts to wobble.