How we added safety gates to a browser-automation rebuild

We had a plan to rebuild our browser-automation infrastructure. It looked solid on first pass. Then we ran it through an adversarial review, and it fell apart.

The original framing was optimistic. It assumed things would work. It didn't measure where we started. It didn't have real rollback paths. It ignored known failure modes, like the fact that our OAuth persistence is flaky.

Here's what v2 looks like after the rebuild.

The pre-flight phase: measure first, break things second

The biggest gap in v1 was that we had no baseline. We couldn't answer "did we actually improve?" because we'd never captured what "before" looked like.

v2 now starts with a mandatory pre-flight phase (P1 through P6) that forces baseline metric collection before any implementation touches the infrastructure:

Current spend. How much are we burning per day right now?
Orphan rate. How many stale contexts are sitting in memory?
Recording storage costs. What's the disk footprint of our session logs?
Feature verification. Can we actually do what we're about to build?
Cost cap decision. What's the hard ceiling we're willing to hit before sessions get closed?

If you can't measure it at the start, you can't know if the work mattered. That's the move.

Honest framing: OAuth breaks in ways we need to detect

The original plan glossed over a real problem. Third-party SSO and session cookies persist in our browser contexts at only 40-60% reliability. Sessions die silently. Agents keep running against stale logins and fail downstream.

v2 adds two explicit mechanisms to catch this:

OAuth-expiry detection. A new sentinel check runs daily per context. When it detects a third-party login redirect, it posts a re-seed prompt to the team chat. This is not a fix, it's an early warning system that something broke.

Cleanup without the semantic mismatch. v1 tried to hook cleanup into our agent runtime's session-stop event. That's a category error: the session-stop event fires on code-execution bounds, not on session problems. v2 runs cleanup as a scheduled system job every 10 minutes. It's boring, but it works.

Cost runaway: two-minute polling with hard stops

We've all seen a single bad query turn into a $10k bill overnight. v2 adds a cost-cap enforcer that polls every 2 minutes:

Max-concurrent limit. How many sessions can we run at once?
Daily-cap limit. What's the total spend we'll allow in a day?

When either threshold is hit, it posts an alert to the team chat and optionally closes the newest sessions. This is not elegant, but it's predictable. The bill stops.

Dry-run gates: 24 hours of watching logs before we go live

v1 had no gate between "we think this will work" and "this is now running every 10 minutes in production."

v2 requires the cleanup job to run with a dry-run flag for 24 hours first. The team manually inspects the logs to make sure it's doing what we said it would do. Only after manual sign-off does it get promoted to a live scheduled job.

This slows us down by a day. It's worth it.

We don't ship infrastructure changes. We ship infrastructure changes that we've watched fail safely first.

Rollback: a single command to restore

v1 had no rollback procedure. If something went wrong, the answer was "delete the scripts and restart services manually." That's not a plan.

Phase D in v2 adds rollback infrastructure:

Pre-prune rollback testing. Before we remove any browser profiles, we test the restore path on a staging context.
A single restore command. It idempotently restores all deleted profiles and configs. Run it once. Run it ten times. Same result.

What we dropped

A separate pre-router project was in the original scope. It looked related. It wasn't. It's a separate problem (routing efficiency). The browser-automation work is a separate problem (browser collision and token cost). v2 descopes the pre-router work and moves it to a follow-up. Scope creep is not focus.

Verification: 12 explicit checks, not aspirational intent

The original had a 10-item verification checklist that was mostly "does this feel right?" v2 expanded to 12 items with explicit, machine-readable checks:

Cost cap logs show the polling loop is alive.
The OAuth monitor posts alerts when redirects are detected.
Rollback dry-run completes without errors.
Pre-prune audit checked all repos and config files for references to soon-to-be-removed profiles.

If you can't verify it in a log or a console, it's not verified.

The session timeout change

One small move with outsized impact: session timeout dropped from 1 hour to 30 minutes. Shorter-lived sessions mean fewer orphans. Fewer orphans means lower storage cost and faster cleanup. We kept keep-alive enabled so active sessions don't drop.

This is the kind of micro-change that feels obvious after someone says it, but it wasn't in the original plan.

The infrastructure doc we almost didn't write

The write-back protocol doc is boring. It documents one flow: how recording URLs get written back to our task system. It respects the existing write-review and write-execution gates instead of bypassing them with direct writes.

No one wants to read it. Everyone building against this infrastructure needs to read it. v2 treats documentation as a safety gate, not an afterthought.