System Overview¶
TBD Agents separates concerns across four main components connected by Redis and MongoDB.
High-Level Architecture¶
graph TB
subgraph Clients
Dashboard([Dashboard])
CLI([curl / CLI])
Apps([Applications])
end
subgraph API Layer
FastAPI[FastAPI API<br/>Port 8000]
end
subgraph Message Broker
Redis[(Redis<br/>Broker + Pub/Sub)]
end
subgraph Worker Pool
W1[Worker 1]
W2[Worker 2]
W3[Worker N]
end
subgraph External
SDK[Copilot SDK<br/>JSON-RPC CLI]
Models[Copilot Models API]
MCP1[MCP Server: Jira]
MCP2[MCP Server: Datadog]
MCPN[MCP Server: ...]
end
subgraph Storage
Mongo[(MongoDB)]
end
Dashboard & CLI & Apps -->|HTTP + Auth| FastAPI
FastAPI -->|Enqueue Tasks| Redis
FastAPI -->|Subscribe Events| Redis
FastAPI -->|SSE Stream| Dashboard & CLI & Apps
FastAPI -->|Read/Write| Mongo
Redis -->|Deliver Tasks| W1 & W2 & W3
W1 & W2 & W3 -->|Publish Events| Redis
W1 & W2 & W3 -->|Persist State| Mongo
W1 & W2 & W3 -->|SDK Session| SDK
SDK --> Models
SDK --> MCP1 & MCP2 & MCPN
Components¶
FastAPI API¶
The API layer handles authentication, CRUD for agents/skills/MCP servers/workflows, and SSE streaming. It does not run agent logic — that's dispatched to workers.
- Endpoints serve REST requests and validate GitHub PAT tokens
- SSE endpoint (
GET /api/workflows/{id}/stream) subscribes to a Redis pub/sub channel and streams events to the client - Prompt dispatch —
POST /api/workflows/{id}/promptenqueues a Celery task and returns201immediately
Celery Workers¶
Workers execute the actual agent loop. Each worker:
- Receives a task from the Redis queue containing
(workflow_id, prompt, github_token) - Initialises its own MongoDB connection via Beanie/Motor
- Loads the Workflow, Agent, MCP servers, and Skills from the database
- Creates a Copilot SDK session with the agent's configuration
- Runs the SDK agentic loop — the SDK handles planning, tool calls, and response generation
- Publishes real-time events (logs, message deltas, usage stats, status changes) to Redis pub/sub
- Persists final state (messages, logs, usage, status) to MongoDB
Key Celery settings:
| Setting | Value | Why |
|---|---|---|
worker_prefetch_multiplier |
1 |
Agent tasks are long-running; don't hoard |
task_acks_late |
True |
Re-queue tasks if a worker crashes |
task_reject_on_worker_lost |
True |
Return tasks to the queue on shutdown |
Redis¶
Redis serves two roles:
- Celery broker/backend — task queue for dispatching agent work and storing task results
- Event bus (pub/sub) — workers publish events to
workflow:events:{id}channels; the FastAPI SSE endpoint subscribes and relays events to clients
This decoupling enables multi-process and multi-node scaling — any worker can publish events that any API instance can stream.
MongoDB¶
Stores all persistent state: agents, MCP servers, skills, workflows (including messages, logs, and usage stats). Each worker initialises its own Motor/Beanie connection on startup.
Request Flow¶
sequenceDiagram
participant Client
participant API as FastAPI API
participant Redis
participant Worker as Celery Worker
participant SDK as Copilot SDK
participant MCP as MCP Servers
participant Mongo as MongoDB
Client->>API: POST /prompt (auth + payload)
API->>API: Validate auth + workflow state
API-->>Client: 201 Accepted
API->>Redis: run_agent_task.delay()
Client->>API: GET /stream (SSE)
Redis->>Worker: Deliver task
Worker->>Mongo: Load Workflow + Agent + MCPs + Skills
Worker->>SDK: build_client(token) → session
Worker->>SDK: session.send(prompt)
loop Agent Loop
SDK->>MCP: Tool calls
MCP-->>SDK: Tool results
SDK->>Worker: Events (deltas, usage, logs)
Worker->>Redis: Publish to channel
Redis->>API: Event delivered
API-->>Client: SSE event
end
Worker->>Mongo: Persist final state
Worker->>Redis: Publish status: completed
Redis->>API: Final event
API-->>Client: SSE: status completed
Event Bus Protocol¶
Events published to Redis channel workflow:events:{workflow_id}:
{
"type": "log | message | message_delta | usage | status",
"data": { "..." },
"timestamp": "2026-04-10T12:00:00+00:00"
}
| Event Type | Payload | Description |
|---|---|---|
log |
{event, detail} |
Agent lifecycle events |
message |
{role, content} |
Complete assistant/tool message |
message_delta |
{delta} |
Streaming token fragment |
usage |
{total_in, total_out, cost...} |
Cumulative usage stats |
status |
{status, current_turn} |
Workflow state changes |
Hooks & Error Recovery¶
The agent engine uses the Copilot SDK's hooks system for fine-grained control:
| Hook | Behaviour |
|---|---|
on_pre_tool_use |
Logs tool invocation; denies if max turns exceeded |
on_post_tool_use |
Logs result; injects goal reminder past 50% turns |
on_error_occurred |
Retries recoverable errors up to 2×, then aborts |
on_session_end |
Logs session end reason |
The permission handler enforces max turns by counting tool calls and returning denied-by-rules when the limit is exceeded.