Observability

Who is this page for?

Operators running GAME and contributors adding instrumentation. Pairs with Operations (how to run the stack) and Configuration Reference (every knob).

The four signals

GAME exposes the usual observability signals plus a domain-specific one (strategy execution traces):

Signal

Where

Metrics

Prometheus /metrics (HTTP + DSL counters).

Logs

Structured stdout (JSON in prod/stage) + persisted Logs audit trail.

Errors

Sentry (when SENTRY_DSN is set).

Traces (domain)

Sampled StrategyExecutionLog rows for custom-strategy runs.

Metrics (Prometheus)

When METRICS_ENABLED=true (the default), the app mounts prometheus_fastapi_instrumentator at /metrics. It is wired after CORS but before the routers, so it observes every request middleware yet does not sit behind router-level auth - i.e. /metrics itself is unguarded at the app level and must be protected at the ingress (or disabled) in production.

You get out of the box:

  • HTTP metrics - request counts, durations, and status classes (status codes are grouped; untemplated paths and /metrics are excluded).

  • DSL metrics - custom counters defined in app/engine/dsl_metrics.py, which live in the default prometheus_client registry and are therefore exported automatically:

Metric

Meaning

dsl_execution_duration_seconds

Histogram of custom-strategy execution wall-clock.

dsl_execution_nodes_total

AST nodes visited per run (cost signal).

dsl_execution_errors_total

Failed strategy executions, by error type.

dsl_execution_log_dropped_total

Execution-log rows dropped because the persistence queue was full (see below). Non-zero = the sink is saturated, not that scoring is at risk.

The bundled Compose stack ships a pre-configured Prometheus that scrapes these without extra wiring.

Logging

Logging is configured at startup (app/main.py):

  • Format - plain text in dev; structured JSON in prod/stage (via python-json-logger), with renamed fields (timestamp/level/logger) ready for ingestion.

  • Level - LOG_LEVEL env var (default INFO).

  • Scope - root plus the uvicorn/gunicorn loggers, all to stdout (so a container platform collects them).

On top of stdout logging, the audit trail (AuditLogger / app/util/add_log.py) writes structured Logs rows tagged with the module, level, message, api_key, oauth_user_id, and a correlation id - so a request can be reconstructed from the database, not just the log stream.

The “Network Error” trap

A dashboard “Network Error” with no HTTP status is almost always a backend 500 whose body the browser dropped. The middleware ordering ensures the 500 does carry CORS headers; check the API logs (docker logs GAME_API_DEV) for the real traceback. See Architecture.

Error tracking (Sentry)

Set SENTRY_DSN to enable Sentry. Configuration (app/main.py):

  • SENTRY_ENVIRONMENT and SENTRY_RELEASE tag events.

  • send_default_pii=True and traces_sample_rate=1.0 are set; continuous profiling auto-starts. Review these for your privacy/cost posture before enabling in production - full-rate tracing and PII capture are convenient in staging but may be too much at scale.

Strategy execution traces

The domain-specific signal. Every production run of a custom strategy is handled by the singleton DslExecutionObserver:

  1. It emits the DSL Prometheus metrics above.

  2. It persists a StrategyExecutionLog row on every error, and on successful runs with probability DSL_EXECUTION_LOG_SAMPLE_RATE (default 0.05 = 5%). Errors are always kept regardless of the rate.

A persisted row carries status, latency, node count, caseName, error code, and a bounded node-by-node trace (DSL_EXECUTION_LOG_TRACE_LIMIT, default 200 entries; tail-truncated because the early nodes usually explain why a rule matched).

Off the hot-path by design

The DB write does not block scoring. The observer enqueues the row onto a bounded in-process queue (DSL_EXECUTION_LOG_QUEUE_MAXSIZE, default 1000) drained by a background worker:

  • Scoring pays only the enqueue, never the DB round-trip.

  • If the database falls behind and the queue fills, rows are dropped (and counted by dsl_execution_log_dropped_total) rather than applying backpressure to scoring.

  • On graceful shutdown the lifespan hook flushes the queue (observer.aclose()) so buffered rows are not lost.

So a non-zero drop rate is an alert that the sink is saturated - increase the sample budget’s headroom or the queue size, or speed up the DB - but scoring itself is never the bottleneck.

Why both metrics and traces? Metrics tell you a strategy got slow or started erroring in aggregate; the sampled traces let the strategy author and the on-call engineer look back weeks later at which rule did what on a specific run, without replaying production traffic.

KPIs & operational telemetry

Source

Content

KpiMetrics

Daily rollups: total requests, success/error rate, average latency, active users, retention, average interactions per user.

ApiRequests

Per-request records (endpoint, status, response time, type).

UptimeLogs

Periodic uptime samples.

Surfaced through the dashboard and KPI endpoints:

GET /api/v1/kpi/health_check
GET /api/v1/dashboard/summary
GET /api/v1/dashboard/summary/logs
GET /api/v1/strategies/custom/{id}/metrics   # per-strategy aggregates
GET /api/v1/strategies/custom/compare        # A/B comparison

What to alert on

Signal

Why it matters

dsl_execution_errors_total rising

A published strategy is failing - users aren’t being scored as intended.

dsl_execution_duration_seconds p99 near 500 ms

Strategies are approaching the wall-clock limit; some events may be rejected.

dsl_execution_log_dropped_total > 0

The trace sink is saturated; you’re losing audit visibility (scoring is fine).

HTTP 5xx rate

Backend errors; correlate with Sentry and the Logs audit trail.