How we added safety gates to a browser-automation rebuild
We had a plan to rebuild our BB (browser-based automation) infrastructure. It looked solid on first pass. Then we ran it through an adversarial review, and it fell apart.
The original framing was optimistic. It assumed things would work. It didn't measure where we started. It didn't have real rollback paths. It ignored known failure modes , like the fact that our OAuth persistence is flaky.
Here's what v2 looks like after the rebuild.
The pre-flight phase: measure first, break things second
The biggest gap in v1 was that we had no baseline. We couldn't answer "did we actually improve?" because we'd never captured what "before" looked like.
v2 now starts with a mandatory pre-flight phase (P1 through P6) that forces baseline metric collection before any implementation touches the infrastructure:
- Current BB spend. How much are we burning per day right now?
- Orphan rate. How many stale contexts are sitting in memory?
- Recording storage costs. What's the disk footprint of our session logs?
- BB API feature verification. Can we actually do what we're about to build?
- Cost cap decision. What's the hard ceiling we're willing to hit before sessions get closed?
If you can't measure it at the start, you can't know if the work mattered. That's the move.
Honest framing: OAuth breaks in ways we need to detect
The original plan glossed over a real problem. Google SSO and Clay session cookies persist in our Contexts at only 40-60% reliability. Sessions die silently. Agents keep running against stale logins and fail downstream.
v2 adds two explicit mechanisms to catch this:
OAuth-expiry detection. A new sentinel check (bb-oauth-monitor.ts) runs daily per Context. When it detects a Google or Clay login redirect, it posts a Discord re-seed prompt to the team. This is not a fix , it's an early warning system that something broke.
Cleanup without the semantic mismatch. v1 tried to hook cleanup into Claude Code's SessionStop event. That's a category error , SessionStop fires on code-execution bounds, not on session problems. v2 runs cleanup as a Windows Task Scheduler job every 10 minutes. It's boring, but it works.
Cost runaway: two-minute polling with hard stops
We've all seen a single bad query turn into a $10k bill overnight. v2 adds a cost cap enforcer (bb-cost-cap.ts) that polls every 2 minutes:
- MAX_CONCURRENT limit. How many sessions can we run at once?
- DAILY_CAP limit. What's the total spend we'll allow in a day?
When either threshold is hit, it posts a Discord alert and optionally closes the newest sessions. This is not elegant, but it's predictable. The bill stops.
Dry-run gates: 24 hours of watching logs before we go live
v1 had no gate between "we think this will work" and "this is now running every 10 minutes in production."
v2 requires bb-cleanup.ts to run with a --dry-run flag for 24 hours first. The team manually inspects the logs to make sure it's doing what we said it would do. Only after manual sign-off does it get promoted to a live Task Scheduler job.
This slows us down by a day. It's worth it.
We don't ship infrastructure changes. We ship infrastructure changes that we've watched fail safely first.
Rollback: a single command to restore
v1 had no rollback procedure. If something went wrong, the answer was "delete the scripts and restart services manually." That's not a plan.
Phase D in v2 adds rollback infrastructure:
- Pre-prune rollback testing. Before we remove any MCP profiles, we test the restore path on a staging Context.
- bb-restore.sh. A single command that idempotently restores all deleted profiles and configs. Run it once. Run it ten times. Same result.
What we dropped
NemoClaw pre-router work was in the original scope. It looked related. It wasn't. It's a separate problem (routing efficiency). BB infrastructure is a separate problem (browser collision and token cost). v2 descopes the pre-router work and moves it to a follow-up. Scope creep is not focus.
Verification: 12 explicit checks, not aspirational intent
The original had a 10-item verification checklist that was mostly "does this feel right?" v2 expanded to 12 items with explicit, machine-readable checks:
- Cost cap logs show the polling loop is alive.
- OAuth monitor posts Discord alerts when redirects are detected.
- Rollback dry-run completes without errors.
- Pre-prune audit grepped all repos and config files for references to soon-to-be-removed profiles.
If you can't verify it in a log or a console, it's not verified.
The session timeout change
One small move with outsized impact: session timeout dropped from 1 hour to 30 minutes. Shorter-lived sessions mean fewer orphans. Fewer orphans means lower storage cost and faster cleanup. We kept keepAlive enabled so active sessions don't drop.
This is the kind of micro-change that feels obvious after someone says it, but it wasn't in the original plan.
The infrastructure doc we almost didn't write
WRITEBACK_PROTOCOL.md is boring. It documents one flow: how recording URLs get written back to ClickUp. It respects the existing pm-write-reviewer and pm-write-agent gates instead of bypassing them with direct writes.
No one wants to read it. Everyone building against this infrastructure needs to read it. v2 treats documentation as a safety gate, not an afterthought.