← Obsta Labs

Your AI Explored Seven Architectures. You Only Saw One.

The missing layer between token accounting and decision observability.

March 2026 · Obsta Labs

AI didn't make software development cheaper. It made the cost visible.

Your session telemetry shows tokens in, tokens out, total cost. That's accounting. It tells you nothing about what the model actually decided — or how many alternatives it explored before committing.

An 8-million-token session and a 4-million-token session might contain the same number of decisions. Or a session could burn 12 million tokens and commit to nothing. Token counts measure volume. They don't measure structure. The missing layer is decision boundaries — when did the model converge, and how wide did it search before it got there?

If tokens are decision receipts, we're reading the total at the bottom. We haven't looked at the line items.

8.4M reasoning tokens → 4 decisionsAverage 2.1M tokens per commitment — the cost of convergence, not generation

Decision boundaries

Somewhere inside every reasoning session, the model stops exploring and starts executing. It shifts from comparing architectures to writing one. From generating hypotheses to testing a specific approach. That transition is a decision commit — the moment exploration collapses into a single path.

A real session doesn't contain one decision. It contains a sequence of them, separated by stretches of exploration. The pattern looks something like: explore for 1.2 million tokens, commit, implement for 400K, explore again for 2.2 million, commit again. The implementation phases are cheap. The exploration phases are where the tokens go.

No model currently emits an explicit decision signal. There is no DECISION_COMMIT marker in the token stream. But the boundary is detectable. Reasoning entropy drops. The model shifts from generating questions to generating code. Hypothesis branching narrows. The signals are indirect, but they're consistent — and they're already measurable from the outside, without access to model internals.

The boundary leaves observable traces in the transcript. Tool usage shifts from exploration — search, read, compare — to execution: write, patch, test. Hypothesis language disappears and imperative language appears. The model stops asking "what if" and starts emitting diffs.

This is the metric that Decision Economics pointed toward but couldn't yet deliver: cost per decision. Not cost per token. Not cost per line of code. The cost of each intellectual commitment the system made on your behalf.

Total tokens: 14.2M → Decisions: 7~2M tokens per commitment — a KPI no dashboard tracks yet

The invisible variable

Decision boundaries tell you when the model committed. They don't tell you how hard it searched before committing. Two sessions can each burn 8 million reasoning tokens and arrive at the same decision count, but one explored two approaches while the other explored seven.

Reasoning is a tree, not a line. When a model works through an architecture problem, it doesn't follow a single path. It considers approach A with a Redis cache, approach A with Postgres, approach B using event sourcing, approach C with a message queue. Then it eliminates most of them. We see the surviving branch. The exploration width is invisible.

This hidden variable is the branch factor — how many alternatives the model evaluated before converging. It explains why identical token costs can produce wildly different reasoning quality.

A session with a branch factor of 2 explored narrowly and converged quickly. Efficient, but possibly premature — it may have missed better approaches. A session with a branch factor of 7 explored the design space thoroughly but burned three times the tokens doing it. Both show the same number on your bill. One was an efficient search. The other was thrashing.

Same cost. Completely different reasoning. Token bills hide the shape of exploration.

Thrashing looks like a branch factor above 6 — the model keeps reconsidering architecture without committing. Premature convergence looks like a branch factor near 1 — it locked in the first approach without considering alternatives. Healthy exploration sits between 2 and 4: enough breadth to find the right solution, enough discipline to stop searching.

What decision telemetry looks like

Combine decision boundaries with branch factor and the session dashboard changes fundamentally. Instead of token counts and cost summaries, it shows the structure of reasoning.

Tokens: reasoning 8.4M, output 620K. Reasoning tree: avg branch factor 3.1, max 7, depth 14. Decisions: 4 commits, avg 2.1M tokens to commit, largest exploration 3.4M. Compactions: 3 events, 2.1M tokens dropped.

Once reasoning structure is visible, new metrics appear naturally. Decision density: decisions per million tokens — how efficiently the session produced commitments. Exploration cost: tokens burned before the first commit — the price of initial orientation. Thrash index: exploration tokens divided by implementation tokens — whether the model is converging or spinning.

And the unified metric that ties them together:

Decision Efficiency = decisions / (tokens × branch factor)Not "how much did you spend" but "how efficiently did the system decide"

This is not theoretical. The individual signals — entropy shifts, code generation transitions, hypothesis narrowing — are already observable in session telemetry. The model doesn't need to expose its internal reasoning tree. Aggregated metrics like average branch factor are safe to surface: they reveal the shape of exploration without exposing proprietary chain-of-thought internals.

The moment companies start tracking decision boundaries and branch factor, engineering dashboards stop looking like CPU monitoring and start looking like research lab notebooks. That's a fundamental shift — from measuring compute to measuring cognition.

Why this keeps happening

This architecture — intent declaration, decision runtime, risk enforcement, execution — is not new. It appears independently in every domain where automation outpaces human reasoning.

NASA mission control built it in the 1960s. Decisions under uncertainty during spaceflight were too expensive to make without traceability, so they instrumented the decision process itself: problem statement, options evaluated, tradeoffs compared, final commitment logged. The Apollo 13 recovery wasn't saved by telemetry about fuel or temperature. It was saved by the decision log — knowing which options had been considered, which rejected, and why.

High-frequency trading firms built it in the 2000s. When algorithms can execute thousands of trades per second, the risk isn't computation — it's an unsafe decision propagating instantly. So they added strategy definition, decision monitoring, and execution gates. Every order passes through hard risk controls before it reaches the exchange. Knight Capital lost $440 million in 45 minutes when this layer failed.

Kubernetes built it in 2014. Desired state declared as a spec. Controllers continuously evaluating drift between spec and reality. Admission controllers enforcing policy before anything executes. Observability across the entire runtime.

The pattern is always the same: intent → decision runtime → risk enforcement → execution → verification. Different domains, different decades, identical architecture. Not because they copied each other, but because the problem shape demands it. Any system where automation outpaces human review eventually builds decision governance.

AI-assisted engineering is the latest domain to hit this threshold. Agents produce decisions faster than developers can review them. Context windows decay. Reasoning drifts. Actions compound. The first generation of developer tooling measured what you built — lines of code, test coverage, deployment frequency. The next generation will measure how you decided to build it.