Why safety tools should classify the environment before they classify intent.
A guard blocked me from doing safety work.
I was hardening a piece of my own software — running a battery of hostile inputs against a function that handles sensitive data, the way you stress-test a lock by trying to pick it. Standard defensive practice. You attack your own code so that no one else's attack is the first one it ever sees.
The tool refused. The request, it decided, looked like an attack: the words were aggressive, the target was a sensitive component, and that combination tripped the wire. It didn't ask the only question that mattered — whose code was I attacking. It was mine. I had every authority over it. The "attack" was the safest, most boring thing in the world: a developer testing their own lock.
That is the whole problem in one moment. The guard wasn't stupid. It was under-informed. It judged the text and ignored everything around the text.
Here is the thesis the rest of this essay defends:
Harm is not a property of text alone. It is a property of text, actor, object, and authority."Break into this system" is a crime or a job depending entirely on whether the system is yours and whether you were asked. "Delete everything" is a disaster or a Tuesday depending on whose data it is. The exact same words carry opposite consequences, and the difference lives entirely in the context the words point at — not in the words.
Most safety tools classify the words. They take a string, score how dangerous it looks, and decide. They are, in effect, trying to read intent directly off the surface of the language. And intent, read off the surface, is genuinely hard to classify — which is why these tools are tuned the way they are.
If you build a guard, the math pushes you in one direction. Missing a truly dangerous request is catastrophic — it's the headline, the breach, the harm you can't take back. A false positive, by comparison, feels cheap: someone gets blocked, shrugs, rephrases, moves on.
So you tune for recall. You would rather stop a hundred harmless things than let one dangerous thing through. It is the correct local decision, and it produces a predictable global result: the guard fires constantly on benign input, because under that much recall pressure its decision boundary collapses onto the cheapest available signal — surface patterns. A dangerous-sounding verb near a sensitive-sounding noun. That's enough.
This is why a sophisticated, learned classifier can feel like a crude keyword matcher when it blocks you. Under enough recall pressure, the two become hard to tell apart at the edges. The intelligence is real; it just isn't being spent on your case.
And the cost doesn't vanish. It moves. Every false block lands on a user, costs them a minute and a little trust, and those minutes compound across every guarded surface they touch in a day. Nobody's budget has a line item called false positives, so nobody is incentivized to drive the number down. The user pays it quietly, everywhere. Call it the false-positive tax.
You cannot tune your way out of it. Tuning the classifier can reduce the symptoms, but it cannot fix a missing axis. If the tool is asking the wrong question, a better answer to that question doesn't help.
The guards aren't dumb. They are asking the wrong question.
They ask: is this text dangerous? The answerable question is: is this text dangerous in the place it's aimed?
So split the judgment in two.
First, classify the environment — what the request actually touches. Is the target the user's own resource or someone else's? Which domain does it belong to? This is frequently a fact, not a guess: identity, ownership, a path, a domain, a credential. You can often look it up rather than infer it.
Second, classify the intent — and yes, this part stays fuzzy. Reading what someone means from what they wrote is the genuinely hard problem, and no reframing makes it easy.
Then join the two, deterministically. Aggressive intent against the user's own resource is often legitimate — provided the scope and authority are real. It is a developer testing their own lock. Aggressive intent against a resource that isn't theirs is the thing you were built to stop. The join is a small, legible rule, not another opaque model.
The asymmetry is the whole point. Environment is often knowable. Intent is often fuzzy. Classify them separately. When you fold them into one verdict, the fuzzy axis drags down the knowable one, and you lose the cheapest, most certain signal you had.
Concretely, most of the painful false positives share a shape: the intent looks alarming, and the one signal that would clear it is a fact about the environment the tool never checked.
| What gets blocked | The environment signal that would clear it |
|---|---|
| Fuzzing your own parser | repo ownership · local project |
| Deleting your own staging database | cloud identity · resource tag |
| Testing a system inside an authorized scope | program scope · active window |
| And the inverse: "test my site" against an address that isn't yours | verified resource ownership (which is now absent) |
The last row matters as much as the first three. The same axis that clears legitimate work is the axis that catches the genuinely dangerous request the intent-only guard would wave through, because "test my site" sounds friendly. Environment cuts both ways. That is what makes it the load-bearing signal, not a loophole.
This is where the clean model meets the real world, so it deserves to be said plainly rather than buried.
"This resource is mine" is usually a checkable fact. "I am authorized to test this resource that isn't mine" is not — and that case is everywhere serious work happens: bug-bounty scopes, client engagements, shared infrastructure, contractors with delegated access, staging mirrors, a pentest window that opened an hour ago and closes tonight. In all of these, authority is real but external to the system doing the judging.
The reframe does not make this disappear. What it does is change the kind of problem you're left with. Intent-in-a-vacuum is unanswerable — there is no fact of the matter to check. "Does this actor hold authority over this object, right now, within this scope?" is answerable in principle: it points at a permission, a signed scope, a delegation, an expiry. Hard, often un-automated, sometimes requiring a human — but bounded and checkable, which the intent question never was. The win is not that the hard case vanishes. It is that you've traded an impossible question for a difficult one.
None of this is new — it's just shelved in a different building.
Access control has always treated this as the central question. An action is permitted when the actor has authority over the object. "Aggressive verb against your own object" is not a hard case in security; it is usually the definition of authorized work. The capability and the authorization are evaluated together, and neither is sufficient alone.
What's new is only the port. Content and safety classification grew up in their own tradition, and that tradition mostly classifies the message and stops there. It rarely asks who the actor is, what object the message points at, or whether authority exists between them. Borrowing the authorization model from access control and bringing it into the place where we judge language — that's the move. Not an invention. A relocation.
Even a guard with both axes will be wrong sometimes. The environment can be misread; the intent can be misjudged. So the last design choice is what the guard does when it fires.
The wrong answer is what happened to me: silent refusal, no explanation, no recourse. A guard that blocks without showing its reasoning, and offers no way through, makes the user a suspect with no appeal.
The better answer is a mirror. Show the evidence — here is what tripped, and why. Suggest a path — if this is authorized, confirm it this way. And let an authorized user proceed, with that override authenticated and logged.
That last word is load-bearing, and it's where the hard case from earlier comes back to bite. An override can't be a checkbox that says "I'm authorized" — that just rebuilds the false-positive problem as a false-negative one, where anyone clears any block by clicking. The override has to be an authorization event: tied to a verified identity, scoped, recorded, attributable after the fact. Deny by default; verify authority where you can; and where you genuinely can't verify it in the moment, make the override expensive enough to be rare and logged enough to be reviewable. The point of the mirror is not to let everyone through. It is to turn a silent refusal into a visible, accountable decision.
The guards aren't dumb. They are asking the wrong question.Judge the text, the actor, the object, and the authority together — and stop pretending the words alone carry the harm.The guards we build will keep getting more capable. The temptation will be to make them smarter at the question they already ask. The more useful move is to change the question.
This is the design principle behind the systems we build at Obsta Labs.
Related: When You Can Build Any Gate, Don't · Trust Boundaries.