← Obsta Labs

The Agent Must Not Close Its Own Ticket

May 2026 · Obsta Labs

AI coding agents are becoming night-shift workers

Tools like Google Jules, Devin, OpenHands, Cursor, and Copilot are pushing the same direction: repo-connected agents that can plan work, edit code, run checks, and hand back a pull request or patch. The dominant product narrative in AI-assisted development right now is the autonomous coding agent.

That is not a criticism. Execution is the right first problem to solve.

This is real. It works. Models are good enough that an asynchronous agent can produce meaningful code while you sleep.

The question that doesn't have a good answer yet is what happens at the end of the night.

The real risk is not bad code; it is premature closure

A few weeks ago I lost six hours debugging a problem that came down to a single missing environment variable on Cloud Run. The variable had been silently absent for forty-four consecutive days. Nothing alerted. The deploy workflow showed green. Tests passed. Customer-facing failures looked exactly like a healthy running service.

Around the same time, a budget coding agent finished a small task inside a temporary worktree. The commit existed locally. The push to the remote failed silently — wrong active account, repository not found, no useful error surfaced. The work-tracking system was updated to mark the ticket "done" because the commit existed. Subsequent audit found eighteen unmerged branches across five repositories for work that was already marked complete.

In both cases, code was not the problem. Closure was the problem.Something declared the work done before it was actually done.

The agents involved did not lie. They completed the local thing they could observe, then the surrounding system treated that as global completion. The path of least resistance, when nobody asks for evidence, is to assume the work is finished.

This pattern scales badly. A single agent in a single afternoon can produce a handful of false closures and a human can clean them up. A fleet of agents working through the night, each declaring its own work done, produces a backlog of silent failures that a morning review cannot keep up with.

Human software teams already separate author from reviewer

When humans write production code in any organization that has been bitten more than once, the closure step is not the author's decision. The author opens a pull request. A different human reviews it. A possibly different human approves the deployment. The system structurally prevents the same person from being both author and approver of the same change. CODEOWNERS files, branch protection rules, four-eyes principles in regulated industries, evidence chains for SOX or HIPAA — all of these encode the same insight:

The actor who produced the work is not the right actor to declare it complete.

This is not a comment on competence. It is structural. The author has too much investment in the work to be the right reviewer of the work.

When the author is an AI agent, the same principle applies. It applies more strongly, in fact, because the agent has no durable accountability for being right — only a task objective that rewards finishing. An LLM asked to review its own diff and produce a summary will produce a plausible summary regardless of whether the diff is correct. There is no built-in friction.

Headless AI development needs the same separation, structurally

The discipline that works for human teams has to be ported to agent teams, but the porting is not trivial. Two humans have persistent identities, social context, professional reputation. Two LLM invocations have none of those things — they have a session ID and a model name. So the role separation has to be enforced by the surrounding system, not by the actors themselves.

A workable shape looks roughly like this:

executor    -> writes code, full freedom inside bounded scope
reviewer    -> different identity, cites files/diff/tests inspected
system      -> verifies review is not a paraphrase of executor summary
authority   -> gates delete/deploy/secret/migration by policy, not model tier
closure     -> refuses to mark work done without verifiable evidence

None of these layers individually are research-grade. They are normal disciplines in well-run human software organizations. What's missing is a coherent place to put them when the actors are agents.

The unit is not a prompt; it is a work order with evidence

In the agent-centric framing of the current market, the unit of work is the prompt — "build me a thing" — and the output is a diff or a pull request. The disciplines above don't fit naturally into that frame. There is no place to attach a reviewer's findings to a prompt. There is no closure transition on a pull request that you can structurally gate.

The unit that does fit is the work order — a persistent record with a lifecycle, an acceptance criterion, an explicit close transition, and room to attach evidence. A work order can be open, in progress, under review, blocked for specific reasons, and closed. Each transition is a place where the system can ask: what evidence supports this transition?

In particular, the under review → closed transition is where the closure discipline lives. The work order does not close unless an evidence bundle exists, the bundle's hashes verify against the captured artifacts and commit references, the executor and reviewer identities are distinct, the review output cites the files, diff hunks, tests, or checks it actually inspected, and any authority-gated commands are accounted for in the bundle.

If any of these checks fail, the transition fails. The work order stays under review. The agent does not get to declare the work done by emitting a confident final message.

This is the layer that I think is missing from current AI-assisted development products. Not the agent itself — the closure discipline around the agent.

The line

If you are designing AI-assisted development for production use — not for demo videos, but for code that real customers will run — the structural rule I keep returning to is:

The agent may write code while you sleep. It must not decide that the work is done while you sleep.The agent is allowed to be capable. It is allowed to be fast. It is allowed to be unsupervised inside a bounded task. It is not allowed to be the actor that closes its own ticket.

Whether this becomes a named discipline in the next year, or gets discovered the hard way after a series of highly-public incidents involving silently closed work, is the open question. I am writing this note because I have already had the smaller version of those incidents inside my own workflow, and the lesson I keep extracting is the same: closure must be harder than execution.

If you have built or seen a coding agent product that ships an actual evidence-backed close gate — not just a review step, but a structural transition where closure fails on missing evidence — I would like to hear about it.

This post is about agent-side drift — the executor declaring its own work done. The companion post, When Execution Becomes Cheap, Direction Becomes Scarce, is about the symmetric human-side problem — the operator with infinite capability and no vector. Both failures share the same shape: anchoring against drift in a system that has no natural friction.