Observability & Monitoring¶
TBD Agents ships a full MELT (Metrics, Events, Logs, Traces) stack powered by OpenTelemetry, Prometheus, Grafana, Loki, and Tempo.
┌───────────────┐ ┌───────────────┐
│ FastAPI app │ │ Celery worker │
│ :8000/metrics │ │ :9101/metrics│
└───────┬───────┘ └───────┬───────┘
│ OTLP gRPC │ OTLP gRPC
▼ ▼
┌─────────────────────────┐
│ OTel Collector :4317 │
│ Prometheus :8889 │
└──────┬──────────┬───────┘
│ │
▼ ▼
┌───────────┐ ┌──────┐ ┌──────┐
│Prometheus │ │Tempo │ │ Loki │
│ :9090 │ │:3200 │ │:3100 │
└─────┬─────┘ └──┬───┘ └──┬───┘
│ │ │
▼ ▼ ▼
┌──────────────────────────┐
│ Grafana :3000 │
│ admin / copilot │
└──────────────────────────┘
Quick Start¶
All observability services start automatically with Docker Compose:
Open Grafana at http://localhost:3000 — default
credentials: admin / copilot.
Custom Prometheus Metrics¶
The app defines 14 custom metrics in app/observability.py. All are
recorded by the agent engine during task execution and exposed via
/metrics (app) and :9101/metrics (worker).
Counters¶
| Metric | Labels | Description |
|---|---|---|
copilot_hub_tokens_total |
direction, model |
Total tokens consumed (input, output, cache_read, cache_write) |
copilot_hub_cost_dollars_total |
model |
Estimated LLM cost in USD |
copilot_hub_premium_requests_total |
model |
Premium API requests consumed |
copilot_hub_agent_tasks_total |
status, model, reasoning_effort |
Agent tasks by outcome |
copilot_hub_tool_calls_total |
tool_name |
Tool invocations by name |
copilot_hub_mcp_connections_total |
server_name |
MCP server connections initiated |
copilot_hub_repo_sync_total |
status |
Repository sync operations (success/failure) |
Histograms¶
| Metric | Labels | Buckets | Description |
|---|---|---|---|
copilot_hub_agent_task_duration_seconds |
model, status |
1s – 30min | Task execution time |
copilot_hub_tool_calls_per_task |
model |
1 – 200 | Tool calls per task |
copilot_hub_cost_per_task_dollars |
model |
$0.001 – $10 | Cost distribution per task |
copilot_hub_repo_sync_duration_seconds |
— | 0.5s – 60s | Repo sync duration |
Gauges¶
| Metric | Description |
|---|---|
copilot_hub_agent_tasks_active |
Currently running agent tasks |
copilot_hub_sse_connections_active |
Active SSE connections |
copilot_hub_celery_queue_length |
Tasks waiting in the Celery queue |
Scrape Targets¶
Prometheus is configured with three scrape jobs
(observability/prometheus.yml):
| Job | Target | Metrics |
|---|---|---|
fastapi |
app:8000/metrics |
FastAPI instrumentator + custom app metrics |
celery-worker |
worker:9101/metrics |
All custom metrics recorded in worker processes |
otel-collector |
otel-collector:8889 |
OTel Collector internal metrics |
Grafana Dashboards¶
Two pre-provisioned dashboards are available in Grafana.
Overview (copilot-hub-overview)¶
Six rows covering the full system:
- API Overview — Request rate, latency percentiles, error rate
- Agent Executions — Active tasks, completion rate, duration percentiles
- Token & Cost — Input/output tokens, cost accumulation, premium requests
- Tool Calls — Call rate by tool, top tools (24h bar chart)
- MCP Servers — Connection rate, repo sync operations + p95 duration
- System Resources — Celery queue depth, active SSE connections, in-flight HTTP
LLM & Cost Analytics (copilot-hub-llm)¶
Deep dive into model usage and spend:
- Summary stats — Total tokens, cost, premium requests, success rate (24h)
- Token usage — Input vs output over time, cache hit rates
- Cost analysis — Cost over time, cost-per-task distribution
- Model performance — Duration by model, tasks by status/reasoning effort
- Traces — Recent distributed traces via Tempo
- Logs — Combined app + worker logs via Loki
Template variables (model, job) are not yet wired into panel
queries. They will be added in a future iteration once baseline
metric labels stabilise.
Alerting Rules¶
Prometheus evaluates alerting rules from observability/alert-rules.yml.
All rules are pre-configured — no Alertmanager is required for rule
evaluation, but you should add one to receive notifications.
| Alert | Condition | Severity | Description |
|---|---|---|---|
HighTaskFailureRate |
>25% failure rate for 5m | critical | Agent tasks are failing at high rate |
NoTaskCompletions |
Tasks submitted but none complete for 15m | warning | Workers may be stuck |
CostSpikeHourly |
>$50 in 1 hour | warning | Unexpected LLM spend |
CostSpikeDaily |
>$500 in 24 hours | critical | Major cost incident |
CeleryQueueBacklog |
>10 tasks queued for 5m | warning | Processing falling behind |
CeleryQueueCritical |
>50 tasks queued for 2m | critical | Workers overwhelmed or down |
WorkerDown |
Metrics endpoint unreachable for 2m | critical | Celery worker unreachable |
AppDown |
Metrics endpoint unreachable for 2m | critical | FastAPI app unreachable |
SlowTaskExecution |
p95 duration >5 min for 10m | warning | Tasks taking unusually long |
HighSSEConnections |
>100 active SSE for 5m | warning | Possible connection leak |
Adding Alertmanager¶
To receive alert notifications, add Alertmanager to your
docker-compose.yml:
alertmanager:
image: prom/alertmanager:v0.27.0
volumes:
- ./observability/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
ports:
- "9093:9093"
Then add to prometheus.yml:
Distributed Tracing¶
OpenTelemetry auto-instruments FastAPI, HTTPX, and Celery. Traces flow through the OTel Collector to Tempo.
- App service name:
tbd-agents-api - Worker service name:
tbd-agents-worker
View traces in Grafana → Explore → Tempo, or use the embedded trace panel in the LLM Analytics dashboard.
Correlating Logs and Traces¶
Promtail extracts trace_id from log lines. In Grafana's Loki explorer,
click a trace ID to jump directly to the corresponding trace in Tempo.
Troubleshooting¶
No metrics in Prometheus¶
- Check the targets page: http://localhost:9090/targets
- Ensure
app:8000/metricsandworker:9101/metricsshow as UP - If the worker target is DOWN, verify
WORKER_METRICS_PORT=9101is set
Dashboards show "No data"¶
- Metrics only appear after at least one agent task has been executed
- Check the time range in Grafana (default is last 6h for Overview, 24h for LLM Analytics)
- Run a test workflow to generate initial data
High Celery queue length¶
- Check worker logs:
docker compose logs worker --tail=100 - Scale workers:
docker compose up -d --scale worker=3(the worker metrics port is exposed only to the Docker network, so scaling works without host-port conflicts)
Traces not appearing in Tempo¶
- Verify
OTEL_EXPORTER_OTLP_ENDPOINTis set in both app and worker - Check OTel Collector logs:
docker compose logs otel-collector --tail=50 - Ensure Tempo is receiving data: http://localhost:3200/ready