<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=355311964967439&amp;ev=PageView&amp;noscript=1">

Why we deployed 2 concurrent slots instead of 3

Two active inference containers and one on standby

We built a 3-container inference fleet and validated it under load. The results forced a choice: deploy all three slots and accept degraded performance, or stay conservative, keep the infrastructure, and measure real production workloads before committing.

We chose the second path. Here's why.

The Concurrency Test

We ran a systematic load test on both 2-slot and 3-slot deployments using a synthetic worst-case probe. The results were unambiguous.

The 2-slot run passed cleanly. Both slots hit 10/10 success rates (we required ≥95%), and p95 latency stayed at 2207-2223ms. That's 1.44× baseline, well under our 2702ms gate. Throughput measured ~1.35× single-slot (we required ≥1.4×, so borderline but acceptable).

The 3-slot run hit a wall. Perfect 30/30 success across all three slots, but p95 latency landed at 3207-3222ms, about 3% over our 3088ms threshold. Throughput was ~1.40×, below the ≥1.7× target. Marginal failures, but failures nonetheless.

  • 2-slot validation. 100% success, 1.44× latency tax, clean pass on all criteria.
  • 3-slot validation. 100% success, 2.08× latency tax, 3% latency overage, 18% throughput shortfall.
  • The decision. Deploy 2 active slots, keep the third container running for standby and future re-evaluation.

Why the third slot struggled

The pattern is predictable. We measured ~700ms of latency added per concurrent slot. That tracks with bandwidth, not compute.

Our inference hardware runs unified memory, 273GB/s LPDDR5X bandwidth shared across N concurrent inference processes. At 3-concurrent, each slot sees roughly 91GB/s effective bandwidth. At that point, inference becomes prefill-bandwidth-bound instead of compute-bound. The GPU is fast enough; memory can't feed it data fast enough.

The good news: our local inference runtime does achieve real parallelism. If it were queueing requests serially, 2-slot would degrade 2.0× and 3-slot would degrade 3.0×. Our actual 1.44× and 2.08× degradation factors prove concurrent inference is happening. The bandwidth ceiling is the constraint, not slot count.

The production workload picture

Here's where the synthetic probe diverges from reality.

Our probe is pathological for concurrency: inference is only ~33% of the total attempt time (roughly 0.5s inference, 1.0s browser automation and network wait). When inference latency doubles due to queueing, wallclock time only increases ~33%. The browser-wait portion doesn't degrade with concurrency, it runs in parallel.

Our Phase 1 production workloads show different numbers. End-to-end p95 was ~22s, with ~14s spent in inference (64% inference fraction). Browser wait and network overhead totaled ~8s.

Apply the measured 1.48× inference tax from our 2-slot testing to that 14s: roughly 20.7s of inference time. The ~8s browser-wait portion runs concurrently, untouched. Total projected p95: ~28s. That's 1.27× degradation despite 1.48× inference tax.

We required ≤45s for Phase 2. We're well inside that envelope.

Three containers, two active dispatch

We kept all three containers built and running. Containers are low overhead. And keeping the third one available bought us operational flexibility without rebuild cost.

Our dispatch layer enforces a maximum of 2 concurrent inference requests. That keeps us in the validated, clean-pass region.

The third slot serves three purposes in production:

  • Cold standby. If either active container fails health checks, the third promotes to active without container spin-up delay.
  • Dedicated audit runs. Long-running multi-step audit templates can dispatch to the third slot without competing against primary agent tasks. We accept higher latency in exchange for no queue wait.
  • Measurement checkpoint. After 30 days of production data, we'll run a follow-up test on real workflows to see if actual 3-concurrent scaling beats what the worst-case probe predicted.

We stay conservative in deployment while leaving the architectural door open for future optimization.

When we'll flip to 3-active

We documented concrete criteria for promoting the third slot to primary dispatch. If follow-up measurements on actual workflows show:

  • ≤1.5× p95 degradation vs single-slot baseline, AND
  • Throughput speedup ≥2.0×

...then the third slot activates. This is a data-driven trigger, not guesswork. It prevents premature commitment to 3-concurrent while keeping the door open.

What we deferred

Our original Phase 2 plan proposed setting an explicit parallelism config to enable higher concurrency. The probe results made that unnecessary.

The runtime's default auto-tuning already provides real parallelism. If it were queueing serially, we'd see 2.0× and 3.0× degradation. We don't. We see 1.44× and 2.08×. The bandwidth ceiling, not slot limits, is the bottleneck. Reconfiguring for explicit parallelism is deferred unless future bottleneck analysis proves otherwise.

YOU EARNED 1 TICKET!

Log in to save your tickets and redeem prizes.

New player? CREATE PLAYER CARD

✓ TICKET COLLECTED

NEXT UP

What Is the HubSpot Conversations Inbox and Why You Should Care

REVOPS + 🎫

How we added safety gates to a browser-automation rebuild

AI + 🎫

Five Browser Automation Platforms: Self-Hosted vs. Managed Trade-offs

AI + 🎫