DocuGardenerDocs

Observability

DocuGardener ships with a pre-wired Prometheus + Grafana stack. Production deployments get dashboards and alert rules with zero extra setup. Local dev requires a one-line change to opt in.

What's included

  • Prometheus — scrapes /metrics on the FastAPI backend (port 8000). Metrics are emitted via prometheus_fastapi_instrumentator plus custom counters and gauges.
  • Grafana — pre-provisioned datasource, three dashboards (Overview, LLM Usage, Queue Health), and two alert rules. No manual setup needed.
  • Provisioning — datasources, dashboards, and alerting rules are loaded automatically from docker/grafana/provisioning/ at container startup.

Enabling in local dev

The Prometheus and Grafana services are commented out in docker/docker-compose.yml by default to keep the dev environment lightweight. To enable them, uncomment the prometheus and grafana service blocks, then restart:

# In docker/docker-compose.yml — uncomment these two service blocks:
#   prometheus:
#   grafana:

make dev-up

Grafana is then available at http://localhost:3004.

Default credentials: admin / admin. Change the password immediately or set GF_SECURITY_ADMIN_PASSWORD in your .env before starting.

Accessing in production

In production (docker/docker-compose.prod.yml) both services are enabled by default. Grafana is intentionally not exposed on a public host port — access it through an SSH tunnel:

ssh -L 3000:localhost:3000 user@your-vps-ip

Then open http://localhost:3000 in your browser.

If you need to expose Grafana publicly (e.g. behind your reverse proxy), set GF_SERVER_ROOT_URL to your public URL and add appropriate authentication in front of it. Never expose Grafana without authentication.

Metrics reference

All custom metrics are exposed in Prometheus format at http://<backend>:8000/metrics. Standard FastAPI request metrics are also included via prometheus_fastapi_instrumentator.

Metric nameLabelsDescription
docugardener_webhooks_totalevent_type, successCounter — webhook events received from GitHub, labelled by event type (e.g. pull_request) and whether processing succeeded.
docugardener_jobs_completed_totalstatusCounter — analysis jobs completed, labelled by terminal status (e.g. RESOLVED, FAILED, CLEAN).
docugardener_llm_requests_totalprovider, model, successCounter — LLM API calls made, labelled by provider (e.g. anthropic, openai), model name, and success flag.
docugardener_llm_tokens_totalprovider, model, typeCounter — tokens consumed, where type is either input or output. Use this to track spend per provider.
docugardener_active_tenantsGauge — number of tenants that have processed at least one job in the last 30 days.

Pre-configured dashboards

Three dashboards are auto-loaded from docker/grafana/provisioning/dashboards/ at startup. They are read-only by default — duplicate a dashboard in Grafana before modifying it.

Overview

High-level health of the system: webhook receive rate, job completion rate, error rate, and active tenant count. The first dashboard to check when something feels off.

LLM Usage

Token consumption broken down by provider and model, input vs. output token split, and request success/failure rates per model. Useful for BYOK cost attribution.

Queue Health

RQ queue depth over time for the high and default priority queues, worker throughput (jobs completed per minute), and stale job detection lag. The alert rules below fire based on these signals.

Alert rules

Two alert rules are provisioned from docker/grafana/provisioning/alerting/alerts.yml. They fire to the docugardener-ops contact point — configure the delivery webhook URL in your .env (see below).

Rule nameConditionSeverity
RQ Queue StuckQueue depth > 0 for more than 5 consecutive minutes with no worker activity.critical
Worker SilentQueue is non-empty but no jobs have completed in the last 10 minutes — indicates a crashed or hung worker.critical

Configuring alert delivery

The docugardener-ops contact point sends a JSON POST to whatever URL you provide in GRAFANA_ALERT_WEBHOOK_URL. This makes it compatible with Slack incoming webhooks, PagerDuty event routing endpoints, Discord webhooks, or any generic webhook receiver.

Example — posting alerts to a Slack channel:

# .env
GRAFANA_ALERT_WEBHOOK_URL=https://hooks.slack.com/services/T00000000/B00000000/XXXX

If this variable is not set, the contact point exists but has no destination — alerts fire internally in Grafana (visible in the Alerting tab) but are not delivered externally. This is acceptable for dev; set it before going to production.

Environment variables

VariableDefaultDescription
GRAFANA_ALERT_WEBHOOK_URLunsetWebhook URL for the docugardener-ops Grafana contact point. Slack, PagerDuty, Discord, or any webhook receiver.
GF_SECURITY_ADMIN_PASSWORDadminGrafana admin password. Must be changed in production — the default is publicly known.
GF_SERVER_ROOT_URLunsetSet to your public-facing URL only if you expose Grafana through a reverse proxy (e.g. https://grafana.example.com). Leave unset when using the SSH tunnel approach.