We built an 8-state task queue for autonomous agents

We just approved the full architecture for an autonomous task-orchestration platform: a centralized task queue for running long workflows through autonomous agents with structured human review cycles. It's the first time we've built something designed explicitly to handle agent autonomy at scale, not just single-shot completions, but multi-step work that can loop back to humans without losing context.

The five concepts that shape everything

We started by asking: what objects do we actually need to track? A task isn't just a task, it arrives as part of a Plan, which belongs to a Project, and it gets picked up by a Worker and reviewed by a Reviewer. Those five first-class concepts became the spine of the whole system.

Then we mapped an 8-state lifecycle that a task moves through from intake to done:

intake → ready → working → awaiting_internal_review → awaiting_client_review → done
Side states: blocked_external, failed_needs_triage, cancelled

Each state transition triggers a corresponding stage update in a dedicated CRM tickets pipeline. We added a small set of custom properties to each ticket (a task ID, a task URL, and a last-event ID) so the CRM and our queue stay in sync without constant API thrashing. One stage per state. No ambiguity.

We needed to know exactly where every task was, what happened to it, and why it went there, without building a second database.

Lock contention almost killed us

Here's where the design got real. We were building the event log as part of the main agent-runs table. Workers stream JSON events as they execute ("started step X", "received response Y", "writing result Z"). With streaming, you get a lot of writes. And every write to a single row, on a table that might have 1000 concurrent rows being tracked, means a row-rewrite lock. One slow write locks everyone behind it.

We almost lost an entire shift of task dispatch to that bottleneck before we split the events into their own append-only table. Now the agent-runs table stays relatively static (status column only), and the event log is a pure append stream. No more lock contention. No more queue backing up because one worker got slow.

That decision rippled into the schema: a dozen tables on a dedicated Postgres instance, with the event log as a dedicated append-only design.

Three tiers of compute, deployed right now

We can't afford to send every task to a top-tier frontier model. We also can't rely on a single vendor. So we stratified compute:

A subscription coding-agent SDK for workers and client-facing output scrubbing
A fast hosted model API as overflow when subscription capacity maxes out
A local open-weights model running on our own GPU for structured-rubric reviewers, routers, classifiers, and digest generators

The AI reviewer runs in a bounded auto-revision loop, max 2 cycles before we escalate to human stamp. The loop is tight: reviewer verdict → did it pass the rubric? → yes, move to awaiting_client_review; no, send back to worker with feedback. After 2 revisions, if it still fails, we park it for internal review and a human decides.

Safety first: dry-run, token budgets, and guardrails

We built guardrails hard into the system because autonomous agents writing to systems at scale is... scary.

A dry-run flag. Workers emit WOULD_DO instead of actual writes. You can watch the entire workflow execute without touching a database.
A daily token budget cap. The scheduler halts on breach. We know exactly how much we're spending per day.
A per-task token cap. A single task that runs over budget gets flagged for triage and escalates to a human.
Explicit dossier guardrails. Worker system prompts treat user-note and reviewer-note content as data, never as instructions. An attacker can't hide a prompt injection in a feedback comment.

Day 1: internal CRM work, single worker

We intentionally constrained scope to something we control: activating our own CRM (Marketing Hub setup, Service Hub configuration, checklist completion). One worker type, single-worker concurrency, real observability before we expand to client work.

Workers dispatch via an async subprocess wrapping the agent CLI in streaming-JSON mode. Each worker gets a dossier (chronological history: prior runs, reviewer feedback, manual notes) injected as markdown context. So when a worker picks up a task that already failed once, it can see exactly what happened and why.

Five phases, two weeks

We're shipping this in five phases:

Phase 0. Verify the host environment (service lingering, agent auth, GPU availability, port availability, network egress, browser-automation slot discovery). Write the design doc.
Phase 1. Database schema, scheduler tick (every 30s, an advisory lock prevents overlaps), worker dispatcher, local-reviewer integration.
Phase 2. CRM connector with outbound write queue and chat notifications.
Phase 3. Stamp surface (either a CRM UI extension card or a deep-link fallback).
Phase 4. Full checklist worker with context enrichment from our retrieval layer (client brief, SOPs, session knowledge).
Phase 5. Verification shakedown: seed tasks in dry-run mode, then live mode, then failure injection (reviewer rejection loop, local-model downtime, agent auth loss, prompt injection resistance, concurrent dispatch lock contention).

The architecture lives on a separate database instance and container network from our existing task platform. Our other task and sync pipelines don't change. Two systems, same team, no collision.

By the end of Phase 5, agents will pull from the queue, complete work end-to-end (including spawning sub-agents), and only queue items for human review when necessary. We'll have organizational visibility into agent activity and API usage across everyone on the team.

We built an 8-state task queue for autonomous agents

The five concepts that shape everything

Lock contention almost killed us

Three tiers of compute, deployed right now

Safety first: dry-run, token budgets, and guardrails

Day 1: internal CRM work, single worker

Five phases, two weeks

NEXT UP

Why we deployed 2 concurrent slots instead of 3

The HubSpot Email Settings That Most Portals Have Never Touched

Your HubSpot Blog Stopped Publishing and Google Noticed