Why we rebuilt our browser agent on local infrastructure

In May, we hit a wall. Our browser-automation agents were running on a managed cloud browser service, and when we tried to push structured output through our local model layer, the whole thing fell apart. That single failure forced a question we'd been deferring: should we build our own browser fleet?

We decided yes. But not for the reasons you'd expect. The real insight came from benchmarking.

The architecture decision: browser local, AI calls hosted

We settled on the same pattern the managed services use internally: Chromium runs locally on our own infrastructure with a CDP (Chrome DevTools Protocol) debugging port exposed. Our existing browser-automation MCP server connects to it remotely over CDP. This reuses proven code instead of duplicating server logic.

Phase 1 spins up a single slot. Phase 2 scales to three concurrent slots, each bound to its own persistent profile. Our hardware has 128GB of unified memory, which gives headroom for 3-4x concurrent inference with bandwidth-shared degradation.

We're not running local models for the agent reasoning. That path (a local serving stack) has near-zero public success reports with our automation framework, and a hardware mismatch made it a separate spike we deferred.

What the benchmarking revealed about model selection

Here's where we found our opinion: the model you assume is the default for browser agents probably isn't. We'd assumed one default. We were wrong.

The managed platforms don't all default to the same model, and the documented workhorse across public browser-agent implementations is a different, cheaper model with strict JSON schema support and the most real-world mileage. So we stopped assuming and benchmarked.

We tested three providers. The harness runs 15 fixtures (10 happy paths + 5 failure modes: timeouts, cookie overlays, anti-bot challenges, JSON schema edge cases, empty results) against 20 runs each. Winner selection uses a single ranking metric across three dimensions:

Success rate with a Wilson 95% confidence interval
Cost broken down by cached vs. uncached tokens
Latency percentiles (p50/p95/p99)

A model has to rank top-2 in all three dimensions to make the cut. We pinned model snapshot IDs and fixture content hashes so we can replay the benchmark at 30/60/90 days and detect drift. Total spend cap per provider: $5.

The model that wins on success rate might lose on cost; the one that wins on cost might have tail latencies that spike. You need all three.

Multi-agent execution: why we didn't use subprocess-per-task

This is the architectural decision worth talking about directly.

When we moved to three concurrent slots, the obvious path was to spawn a subprocess per task, the way a stdio MCP server works. But that costs 5-20 seconds of startup tax per task. At scale, that adds up. More importantly, it wastes the hardware we just bought.

Instead, we built a long-running runner daemon per worker that exposes JSON-RPC over a Unix socket. The daemon stays alive and keeps a persistent connection to Chromium over CDP. A lockfile mediates concurrent access when both the MCP server and the runner daemon attach to the same Chromium instance.

Why not refactor the existing MCP server to handle multi-task concurrency? Because that server is per-agent-instance by design. Changing it would be unjustified scope creep. The runner daemon is new, purposeful, and can outlive any single task.

Task flow looks like this:

A task backlog with custom fields: task type, target, schema ID, audit template, max seconds, dry-run required, priority
A poller writes work items to a local SQLite work queue (WAL mode)
The worker daemon atomically claims a task via BEGIN IMMEDIATE; UPDATE ... RETURNING with an idempotency key and an attempt UUID
JSON-RPC dispatch to the runner daemon
Result written back to SQLite, a comment posted to the task, a team-chat message sent (with any auth tokens redacted)

A registry-backed schema system in our codebase provides JSON Schema files for extract validation, multi-step audit templates with variable interpolation, and canary URLs for auth-check probing.

Auth-check runs per-slot on a 5-minute interval with a three-state machine (green/yellow/red) and debouncing. We only alert on red transitions; silent recovery from yellow to green stays quiet. That keeps signal-to-noise high.

What ships now, what waits

Phase 2 ships two task types: extract (read-only data pulls) and audit (multi-page workflows). We've built the safe-write guard and unit-tested it, but act (write actions) and script (versioned procedures) are Phase 3. We're running synthetic trials against sandbox records to prove the guard works, but nothing goes to production yet.

Multi-target tasks that need a planner above the worker daemons also wait for Phase 3.

How we measure success

Phase 1 runs 14 days side-by-side with the managed service:

Success rate ≥ 0.95
p95 latency ≤ 1.5x the managed baseline
Re-login incidents ≤ 1

Phase 2 runs 14 days in production (read-only scope only):

Per-task success ≥ 0.93
p95 latency ≤ 45 seconds (roughly 2x single-slot baseline)
Escalations ≤ 3/day (escalation = auth red state, system error, or unacknowledged failure within 4 hours)

We also do weekly manual checks for any platform-side security signals (MFA challenges, security emails, rate-limit banners). Any signal halts the trial for a session-strategy re-eval. The read-only constraint keeps blast radius small.

The daily operator loop: a morning dashboard check for slot status and queue depth, an auto-posted daily summary with a ping-budget breakdown, backlog grooming, per-task notifications, escalation-only pings, and a short end-of-day review. Accepted and rejected tasks feed the Phase 3 improvement loop.

The one thing we can't design for

We graded this plan 9/10 after an adversarial review. The one point we can't engineer is the target platform's reaction to three concurrent sessions over 14 days. That's observable, not designable in advance. We'll measure it and adapt.

Why we rebuilt our browser agent on local infrastructure

The architecture decision: browser local, AI calls hosted

What the benchmarking revealed about model selection

Multi-agent execution: why we didn't use subprocess-per-task

What ships now, what waits

How we measure success

The one thing we can't design for

NEXT UP

How we added safety gates to a browser-automation rebuild

What Is the HubSpot Conversations Inbox and Why You Should Care

Five Browser Automation Platforms: Self-Hosted vs. Managed Trade-offs