We just finished validation on a three-slot concurrent architecture, three browser automation agents running in parallel on a single shared memory pool. The results landed us in an awkward middle ground: everything works perfectly, but we're 119 to 134 milliseconds over the latency ceiling we set for ourselves.
Here's what happened.
The Test and the Numbers
We ran 30 synthetic probe attempts split evenly across three concurrent slots, each hitting the same simple extraction workflow. All 30 succeeded. No timeouts, no failures, no resource exhaustion despite three simultaneous browser automation processes competing for 128GB of unified memory.
But here's where it gets tight. We'd set a latency gate at 3088ms for p95 performance (exactly 2.0× our single-agent baseline of 1544ms). The three slots came in at:
- conc-1: 3212ms (+124ms over gate, 3.9% overage)
- conc-2: 3207ms (+119ms over gate, 3.9% overage)
- conc-3: 3222ms (+134ms over gate, 4.3% overage)
All three exceeded the threshold. Marginally. Symmetrically.
The remarkable thing wasn't that we failed the gate, it was how fairly the slowdown distributed across all three slots.
Why the Symmetry Matters
Median latencies clustered in a 15ms spread (3146-3161ms). The p95 values scattered across another 15ms band (3207-3222ms). That tightness is a green flag for the architecture itself. If one slot starved while others performed well, that would signal a scheduling pathology, evidence that the concurrency model itself was broken. Instead, Ollama's parallelism distributed compute resources fairly across all three slots.
The degradation followed a clean, near-linear pattern:
- Single slot: 1544ms (baseline)
- Two slots: ~2215ms (1.43× degradation)
- Three slots: ~3214ms (2.08× degradation)
If the system were hitting a resource cliff, we'd expect super-linear scaling, something like 2.5× or worse for three slots. Instead, each additional concurrent slot added roughly 700ms of absolute latency, suggesting predictable behavior within the memory bandwidth constraints we're working with.
The Queue Dynamics Tell a Story
When we broke down attempt-level performance, the first attempts in each slot revealed queue entry timing:
- conc-1's first attempt: 1439ms (near baseline, entered inference first)
- conc-2's first attempt: 2570ms (queued behind conc-1)
- conc-3's first attempt: 3160ms (full concurrent load by the time it started)
After those initial attempts, all three slots locked into a stable 3100-3200ms steady state. The system reached equilibrium. No cascading degradation. No surprises on subsequent runs.
The Decision Point
We built the Phase 2 plan with a hard rule: "Proceed only if all thresholds pass over the 14-day trial." By strict interpretation, a 4% overage is a 4% overage, and we don't advance. But the plan also flags "observed-not-engineered" risks, a reminder that synthetic probes on example.com don't always predict real production behavior.
Three paths forward:
-
Accept the overage. Real HubSpot workflows spend 60-80% of their time waiting (network, page load), not running inference. If inference latency is only part of the overall task duration, the 4% slowdown might disappear into measurement noise in production.
-
Downscope to two slots. The 2-slot validation passed cleanly with ~1.43× degradation and substantial margin under threshold. Two agents still double single-agent throughput and eliminate the threshold risk.
-
Re-evaluate the threshold. The 2.0× multiplier was somewhat arbitrary. A 2.08× observed degradation might be acceptable depending on what the actual end-to-end user workflows require.
What We're Doing Next
We're running a hybrid approach: deploying two production slots immediately (low-risk, clean validation) while continuing to monitor three-slot performance with real HubSpot workflows instead of synthetic probes. Real 30k-token portal pages with network variability will show us whether this 4% latency difference matters at all.
The test did what it was supposed to, it surfaced a boundary case and forced an architectural decision. Sometimes passing cleanly is easier than being this close and having to decide whether close is good enough.