For seventeen days, our billing endpoint returned valid Stripe checkout URLs. Health checks passed. CORS preflight passed. The API responded in under 200ms. Every monitor was green.
Every button on the product page was dead.
A commit had introduced a multiline string literal into the page's inline JavaScript — a single unterminated quote, three lines where there should have been one. The browser's JS parser hit the syntax error, aborted the script, and silently moved on. The browser logged a syntax error in DevTools, but nothing visible changed for the customer. No click handler attached. No fetch to the billing API. The page loaded, the pricing section rendered, the buttons looked clickable. They did nothing.
The billing service was healthy the entire time. It was ready to create checkout sessions, process webhooks, mint license keys, and send emails. Nobody asked it to.
We found the bug when someone tried to buy.
We had monitoring. We checked that the billing endpoint returned 200. We verified the Stripe API key was valid. We asserted that CORS headers were configured for every product origin. We confirmed the database was writable and the email provider was delivering.
Every one of those checks passed every five minutes for seventeen days.
The problem is that none of them tested what the customer actually does. A customer does not POST to your checkout API. A customer clicks a button on a page. If the JavaScript that wires the button to the API never executes, your API metrics show zero errors — because they show zero requests.
A healthy endpoint with no traffic is indistinguishable from a healthy endpoint with satisfied customers, unless you are also watching traffic volume. We were not.
A selling flow is not one thing. It is a chain:
We had probes on steps 5, 7, 8, 9, and 10. Steps 1 through 4 were unmonitored. The break was at step 2. Everything downstream was structurally intact and operationally useless.
This is a general failure mode, not specific to billing. Any system where the health of a backend component implies the health of a user-facing flow has the same blind spot. Your login endpoint can return 200 while your login page has a broken form submission handler. Your search API can be fast while the search bar's event listener was removed in a CSS refactor. The component is healthy. The flow is dead.
If your customer clicks a button, your monitoring should prove the button is clickable. If your customer fills a form, your monitoring should prove the form submits.
The distinction matters because it changes what you build. Component monitoring asks: “is this service healthy?” Flow monitoring asks: “can a customer complete this action?” The first question has clean, cacheable answers. The second question is harder, more fragile, and more honest.
We stopped treating “the endpoint works” as evidence that “customers can buy.”
The fix was not one thing. It was a set of independent checks, each covering a different link in the chain. A syntax check that proves the page JavaScript parses. A static check that proves the pricing buttons still have their billing wires. A server-side smoke test that proves the billing API can create and retrieve a checkout session. A runtime probe that watches the selling flow, not only the service behind it.
A single end-to-end probe is useful, but it is not enough by itself. When it fails, you still need to know which link broke. Independent probes for independent failure modes make the signal diagnosable.
The seventeen-day outage happened because our monitoring was correlated. Every check derived its confidence from the same source: the backend is healthy. When the failure mode was upstream of the backend — a three-line JavaScript syntax error — all checks passed unanimously and incorrectly.
A noisy failure is an incident. A silent failure is a leak. Incidents get postmortems, fixes, and structural prevention. Leaks get discovered in monthly revenue reconciliation, or when a customer emails to ask why the buy button does nothing, or — in the worst case — never.
We do not know exactly how much revenue this cost. That is part of the problem with silent failures. We can prove that visitors to the affected pricing page saw buttons that did nothing and had no obvious way to tell us. Some percentage of those visitors may have been ready to pay. All of them bounced.
The operational cost of detecting this failure was near zero — a syntax check that runs in under a second. The business cost of not detecting it was unbounded.
Revenue systems should fail loudly before customers silently bounce.
The unit of reliability is not the endpoint. It is the customer action.