We had a plan to rebuild our BB (browser-based automation) infrastructure. It looked solid on first pass. Then we ran it through an adversarial review, and it fell apart.
The original framing was optimistic. It assumed things would work. It didn't measure where we started. It didn't have real rollback paths. It ignored known failure modes , like the fact that our OAuth persistence is flaky.
Here's what v2 looks like after the rebuild.
The biggest gap in v1 was that we had no baseline. We couldn't answer "did we actually improve?" because we'd never captured what "before" looked like.
v2 now starts with a mandatory pre-flight phase (P1 through P6) that forces baseline metric collection before any implementation touches the infrastructure:
If you can't measure it at the start, you can't know if the work mattered. That's the move.
The original plan glossed over a real problem. Google SSO and Clay session cookies persist in our Contexts at only 40-60% reliability. Sessions die silently. Agents keep running against stale logins and fail downstream.
v2 adds two explicit mechanisms to catch this:
OAuth-expiry detection. A new sentinel check (bb-oauth-monitor.ts) runs daily per Context. When it detects a Google or Clay login redirect, it posts a Discord re-seed prompt to the team. This is not a fix , it's an early warning system that something broke.
Cleanup without the semantic mismatch. v1 tried to hook cleanup into Claude Code's SessionStop event. That's a category error , SessionStop fires on code-execution bounds, not on session problems. v2 runs cleanup as a Windows Task Scheduler job every 10 minutes. It's boring, but it works.
We've all seen a single bad query turn into a $10k bill overnight. v2 adds a cost cap enforcer (bb-cost-cap.ts) that polls every 2 minutes:
When either threshold is hit, it posts a Discord alert and optionally closes the newest sessions. This is not elegant, but it's predictable. The bill stops.
v1 had no gate between "we think this will work" and "this is now running every 10 minutes in production."
v2 requires bb-cleanup.ts to run with a --dry-run flag for 24 hours first. The team manually inspects the logs to make sure it's doing what we said it would do. Only after manual sign-off does it get promoted to a live Task Scheduler job.
This slows us down by a day. It's worth it.
We don't ship infrastructure changes. We ship infrastructure changes that we've watched fail safely first.
v1 had no rollback procedure. If something went wrong, the answer was "delete the scripts and restart services manually." That's not a plan.
Phase D in v2 adds rollback infrastructure:
NemoClaw pre-router work was in the original scope. It looked related. It wasn't. It's a separate problem (routing efficiency). BB infrastructure is a separate problem (browser collision and token cost). v2 descopes the pre-router work and moves it to a follow-up. Scope creep is not focus.
The original had a 10-item verification checklist that was mostly "does this feel right?" v2 expanded to 12 items with explicit, machine-readable checks:
If you can't verify it in a log or a console, it's not verified.
One small move with outsized impact: session timeout dropped from 1 hour to 30 minutes. Shorter-lived sessions mean fewer orphans. Fewer orphans means lower storage cost and faster cleanup. We kept keepAlive enabled so active sessions don't drop.
This is the kind of micro-change that feels obvious after someone says it, but it wasn't in the original plan.
WRITEBACK_PROTOCOL.md is boring. It documents one flow: how recording URLs get written back to ClickUp. It respects the existing pm-write-reviewer and pm-write-agent gates instead of bypassing them with direct writes.
No one wants to read it. Everyone building against this infrastructure needs to read it. v2 treats documentation as a safety gate, not an afterthought.