Skip to content

System Overview

TBD Agents separates concerns across API, workers, model/tool integrations, Redis, and pluggable storage.


High-Level Architecture

graph TB
    subgraph Clients
        Dashboard([Dashboard])
        CLI([curl / CLI])
        Apps([Applications])
    end

    subgraph API Layer
        FastAPI[FastAPI API<br/>Port 8000]
    end

    subgraph Message Broker
        Redis[(Redis<br/>Broker + Pub/Sub)]
    end

    subgraph Worker Pool
        W1[Worker 1]
        W2[Worker 2]
        W3[Worker N]
    end

    subgraph External
        SDK[Copilot SDK<br/>JSON-RPC CLI]
        Models[Copilot Models API]
        MCP1[MCP Server: Jira]
        MCP2[MCP Server: Datadog]
        MCPN[MCP Server: ...]
    end

    subgraph Storage
        Docs[(MongoDB or PostgreSQL)]
        Vectors[(Qdrant or pgvector)]
    end

    Dashboard & CLI & Apps -->|HTTP + Auth| FastAPI
    FastAPI -->|Enqueue Tasks| Redis
    FastAPI -->|Subscribe Events| Redis
    FastAPI -->|SSE Stream| Dashboard & CLI & Apps
    FastAPI -->|Read/Write| Docs

    Redis -->|Deliver Tasks| W1 & W2 & W3
    W1 & W2 & W3 -->|Publish Events| Redis
    W1 & W2 & W3 -->|Persist State| Docs
    W1 & W2 & W3 -->|Semantic retrieval| Vectors
    W1 & W2 & W3 -->|SDK Session| SDK

    SDK --> Models
    SDK --> MCP1 & MCP2 & MCPN

Components

FastAPI API

The API layer handles authentication, CRUD for agents/skills/MCP servers/workflows, and SSE streaming. It does not run agent logic — that's dispatched to workers.

  • Endpoints serve REST requests and validate GitHub PAT tokens
  • SSE endpoint (GET /api/workflows/{id}/stream) subscribes to a Redis pub/sub channel and streams events to the client
  • Prompt dispatchPOST /api/workflows/{id}/prompt enqueues a Celery task and returns 201 immediately

Celery Workers

Workers execute the actual agent loop. Each worker:

  1. Receives a task from the Redis queue containing (workflow_id, prompt, github_token)
  2. Initialises the configured document-store connection (DB_BACKEND=mongo or postgres)
  3. Loads the Workflow, Agent, MCP servers, Skills, knowledge, and memory from the database
  4. Creates a Copilot SDK session with the agent's configuration
  5. Runs the SDK agentic loop — the SDK handles planning, tool calls, and response generation
  6. Publishes real-time events (logs, message deltas, usage stats, status changes) to Redis pub/sub
  7. Persists final state (messages, logs, usage, status) to the configured document store

Key Celery settings:

Setting Value Why
worker_prefetch_multiplier 1 Agent tasks are long-running; don't hoard
task_acks_late True Re-queue tasks if a worker crashes
task_reject_on_worker_lost True Return tasks to the queue on shutdown

Redis

Redis serves two roles:

  1. Celery broker/backend — task queue for dispatching agent work and storing task results
  2. Event bus (pub/sub) — workers publish events to workflow:events:{id} channels; the FastAPI SSE endpoint subscribes and relays events to clients

This decoupling enables multi-process and multi-node scaling — any worker can publish events that any API instance can stream.

Document and Vector Storage

The document store is MongoDB by default and PostgreSQL when DB_BACKEND=postgres. It stores agents, MCP servers, skills, workflows, schedules, task executions, tokens, providers, guardrails, knowledge metadata, and memories.

Semantic memory and knowledge retrieval use Qdrant by default or pgvector when VECTOR_STORE_BACKEND=pgvector.


Request Flow

sequenceDiagram
    participant Client
    participant API as FastAPI API
    participant Redis
    participant Worker as Celery Worker
    participant SDK as Copilot SDK
    participant MCP as MCP Servers
    participant Store as Document Store

    Client->>API: POST /prompt (auth + payload)
    API->>API: Validate auth + workflow state
    API-->>Client: 201 Accepted
    API->>Redis: run_agent_task.delay()

    Client->>API: GET /stream (SSE)

    Redis->>Worker: Deliver task
    Worker->>Store: Load Workflow + Agent + MCPs + Skills
    Worker->>SDK: build_client(token) → session
    Worker->>SDK: session.send(prompt)

    loop Agent Loop
        SDK->>MCP: Tool calls
        MCP-->>SDK: Tool results
        SDK->>Worker: Events (deltas, usage, logs)
        Worker->>Redis: Publish to channel
        Redis->>API: Event delivered
        API-->>Client: SSE event
    end

    Worker->>Store: Persist final state
    Worker->>Redis: Publish status: completed
    Redis->>API: Final event
    API-->>Client: SSE: status completed

Event Bus Protocol

Events published to Redis channel workflow:events:{workflow_id}:

{
  "type": "log | message | message_delta | usage | status",
  "data": { "..." },
  "timestamp": "2026-04-10T12:00:00+00:00"
}
Event Type Payload Description
log {event, detail} Agent lifecycle events
message {role, content} Complete assistant/tool message
message_delta {delta} Streaming token fragment
usage {total_in, total_out, cost...} Cumulative usage stats
status {status, current_turn} Workflow state changes

Hooks & Error Recovery

The agent engine uses the Copilot SDK's hooks system for fine-grained control:

Hook Behaviour
on_pre_tool_use Logs tool invocation; denies if max turns exceeded
on_post_tool_use Logs result; injects goal reminder past 50% turns
on_error_occurred Retries recoverable errors up to 2×, then aborts
on_session_end Logs session end reason

The permission handler enforces max turns by counting tool calls and returning denied-by-rules when the limit is exceeded.