Shipped a feature Friday. Woke up Saturday to three broken integrations.

May 7, 2026

Shipped a feature Friday. Woke up Saturday to three broken integrations.

A late-week deploy. A refactor to the part of the stack that decides whether a draft is allowed to publish. The next morning, drafts that were supposed to ship had not shipped. Deploy logs were green. The health endpoint was green. The work product was missing. This piece is for senior operators who run autonomous pipelines and want the exact observability change that turns a Saturday-morning incident into a Friday-evening rollback.

Reading the logs to find the cause took longer than it should have, because the failure mode was perfectly silent. No 500. No alert. No queue backed up. Just refusals firing on schedule, indistinguishable from a healthy system that happened to have nothing to ship. Every observability decision flows from the question this incident forced: did anything change in the world as a result of the work the system reported it did?

What broke when we shipped Friday and woke up Saturday to broken integrations

The refactor added a final voice scan at the publish boundary. The scan worked: it correctly flagged drafts that had been marked clean under an older scanner. The bug was in the rejection path. The code that should have re-queued the rejected drafts for rewriting was wrapped in a try/catch that swallowed errors. The drafts kept being rejected. Nothing was ever re-queued. Posts sat in "scheduled" forever, the dispatcher refused them on every tick, the system reported success on each refusal.

The dispatcher logged "refusal: voice scan failed" on every tick. Standard log format. The refusal was correct. The follow-up action was missing. From the dispatcher's point of view, the work was done: it made the right call about each draft. From the world's point of view, no posts shipped, no re-queues happened, and the entire publishing pipeline was a no-op for sixteen hours.

The shape of a green-and-broken state

Green-and-broken is the natural state of any system where the trigger and the work product live on different layers. The trigger fires, the framework reports success, and the work product is missing because the layer that produces it is silently in an error path. The dispatcher reported truth as it understood it. The understanding was incomplete.

This is not a bug in the dispatcher. It is a bug in the observability contract. The dispatcher was instrumented for "did I make a decision" instead of "did the decision result in something happening." Most autonomous systems are instrumented the first way by default. The fix is small. The discipline is harder.

Why try/catch on Friday is the worst kind of refactor

A try/catch that swallows errors is the most common silent-failure pattern in autonomous code. It is added under deploy pressure, it covers a real edge case, and it converts an observable failure into an unobservable one. Every deploy that adds a try/catch on a hot path needs an explicit answer to the question: what does the system do about the caught error, and where does the answer get logged?

The framework: three log fields that turn silent loops into something a human can fix

Every refusal in an autonomous pipeline needs three fields in the log. Without all three, you are operating blind on the most failure-prone class of work in the stack.

The reason. Why did this artifact get refused. Voice scan failure. Schema mismatch. Missing field. Rate-limit ceiling. The reason is the easy field; most systems already log it.

The artifact id. Which specific draft, order, row, or file got refused. Without the id you can count refusals but you cannot trace them. A refusal count without ids is a metric without a debug path.

What the system did about it. Re-queued. Routed to manual review. Dropped on the floor. Marked as permanently failed. This is the field most teams skip, and it is the one that turns a silent loop into something a human can fix. Without it, you see refusals firing on schedule and assume the system is working. With it, you can answer the only question that matters: did anything change as a result?

The third field is the heartbeat for the rejection path. The other two fields tell you the system noticed the problem. The third tells you whether the system did anything useful with the noticing.

Runbook: the post-deploy health check that watches outputs

1. Identify every autonomous loop the deploy touched. A loop is anything that runs without a human in the request path. Crons, dispatchers, agent webhooks, scheduled flows. List them before you ship. 2. For each loop, write down the expected output in the next window. "Drafts published in the next four hours." "Orders created on this device class in the next hour." "Rows written by this agent in the next tick." Specific. Counted. 3. Set a post-deploy health-check window that matches the cadence. If the loop runs every five minutes, the health check fires at fifteen minutes post-deploy, looks at the last three windows, and confirms expected output is in the world. 4. Watch the OUTPUT, not the deploy logs. Deploy logs say the deploy worked. The work product says the system is doing the job. The two are not the same and conflating them is exactly how Saturday-morning incidents happen. 5. If the post-deploy health check fails, roll back before the silence compounds. The cost of a rollback at fifteen minutes is small. The cost at sixteen hours is brand-class. 6. Add the three log fields (reason, artifact id, action taken) to every refusal path the deploy touched. The added cost is minutes per path. The payback is one incident avoided. 7. Run a deliberate refusal in staging. Confirm the post-deploy check catches it. If it does not, the check is wrong, not the system. 8. Make the post-deploy check part of the deploy gate, not a separate ritual. The gate is what makes Friday deploys safe, not the calendar.

When this is wrong: trade-offs and edge cases

The standard advice is "don't deploy on Friday." That is not actually the lesson here. The lesson is that any change that touches an autonomous loop needs a post-deploy health check that watches the output, not the deploy logs. The day of the week is downstream of the discipline. A Tuesday deploy with no post-deploy check is the same risk as a Friday deploy with no post-deploy check. The calendar is a coping mechanism for missing instrumentation.

There is a real cost to over-instrumenting. Logging every action taken on every refusal is noisy on high-throughput pipelines. The fix is to log the action field at structured-log level (queryable) and aggregate it for the operator panel. The signal is the count of refusals where the action was "dropped on the floor." That number should be near zero on every healthy day.

Post-deploy checks also fail when the comparison is wrong. A check that says "did anything ship in the next window" is weaker than "did the same volume ship as the last comparable window." Slow drifts pass the absolute check and fail the relative one. Match the comparison to the failure mode you are trying to catch.

What success looks like

When you build autonomous systems, you stop being able to trust green checks. The system can be green and broken at the same time because the unit of work is no longer the request and response. The unit of work is whether something happened in the world that should have happened. A short outage you catch is a cheap lesson. A long stretch of silent rejection is a brand incident. The difference is one well-named log field and a post-deploy health check that watches outputs, not status codes.

On the architectures we run on /websites-cro and /email-and-sms, post-deploy health checks are part of the deploy gate. They cost a single-digit number of minutes per deploy. They pay back the first time a refactor enters a try/catch trap and the check catches it before the dispatcher runs through a full quiet window. The qualitative band is meaningful: Saturday-morning incidents go from a regular event to an outlier.

FAQ

Should I just stop deploying on Friday? No. The calendar rule is a coping mechanism for missing instrumentation. Add the post-deploy health check that watches outputs and any day of the week is the same risk. The discipline scales. The calendar does not.

What is the smallest version of this I can ship? Three log fields on every refusal path (reason, artifact id, action taken) plus a single post-deploy check that confirms expected output in the next cadence window. Two evenings of work. Pays back on the first caught incident.

Does this apply to deploys that don't touch autonomous loops? Less so. A static-site deploy with no async work behind it has the same shape as the deploy log: green means green. The discipline matters specifically where the trigger and the work product live on different layers.

How is this different from monitoring? Monitoring runs continuously. The post-deploy health check runs once after a deploy, on the specific loops the deploy touched, with the comparison set to "last comparable window." It is a focused subset of monitoring tied to the change you just made.

What if the rollback itself is risky? Then the architecture is wrong upstream of this article. Every autonomous loop should ship with a rollback path that is reversible inside one cadence window. If the rollback is risky, the loop is more brittle than the deploy gate, and the deploy gate is not your biggest problem.

- https://www.arthea.ai/article/agents-going-quiet-isnt-resting - https://www.arthea.ai/article/three-agents-shipped-meaningful-work - https://www.arthea.ai/ai-lab

If you want a 30-minute review of your post-deploy gates on autonomous loops, the calendar is here: arthea.ai/book.

Go back to Blog

Download the ressource