Observability¶

Who is this page for?

Operators running GAME and contributors adding instrumentation. Pairs with Operations (how to run the stack) and Configuration Reference (every knob).

The four signals¶

GAME exposes the usual observability signals plus a domain-specific one (strategy execution traces):

Signal	Where
Metrics	Prometheus `/metrics` (HTTP + DSL counters).
Logs	Structured stdout (JSON in prod/stage) + persisted `Logs` audit trail.
Errors	Sentry (when `SENTRY_DSN` is set).
Traces (domain)	Sampled `StrategyExecutionLog` rows for custom-strategy runs.

Metrics (Prometheus)¶

When METRICS_ENABLED=true (the default), the app mounts prometheus_fastapi_instrumentator at /metrics. It is wired after CORS but before the routers, so it observes every request middleware yet does not sit behind router-level auth - i.e. /metrics itself is unguarded at the app level and must be protected at the ingress (or disabled) in production.

You get out of the box:

HTTP metrics - request counts, durations, and status classes (status codes are grouped; untemplated paths and /metrics are excluded).
DSL metrics - custom counters defined in app/engine/dsl_metrics.py, which live in the default prometheus_client registry and are therefore exported automatically:

Metric	Meaning
`dsl_execution_duration_seconds`	Histogram of custom-strategy execution wall-clock.
`dsl_execution_nodes_total`	AST nodes visited per run (cost signal).
`dsl_execution_errors_total`	Failed strategy executions, by error type.
`dsl_execution_log_dropped_total`	Execution-log rows dropped because the persistence queue was full (see below). Non-zero = the sink is saturated, not that scoring is at risk.

The bundled Compose stack ships a pre-configured Prometheus that scrapes these without extra wiring.

Logging¶

Logging is configured at startup (app/main.py):

Format - plain text in dev; structured JSON in prod/stage (via python-json-logger), with renamed fields (timestamp/level/logger) ready for ingestion.
Level - LOG_LEVEL env var (default INFO).
Scope - root plus the uvicorn/gunicorn loggers, all to stdout (so a container platform collects them).

On top of stdout logging, the audit trail (AuditLogger / app/util/add_log.py) writes structured Logs rows tagged with the module, level, message, api_key, oauth_user_id, and a correlation id - so a request can be reconstructed from the database, not just the log stream.

The “Network Error” trap

A dashboard “Network Error” with no HTTP status is almost always a backend 500 whose body the browser dropped. The middleware ordering ensures the 500 does carry CORS headers; check the API logs (docker logs GAME_API_DEV) for the real traceback. See Architecture.

Error tracking (Sentry)¶

Set SENTRY_DSN to enable Sentry. Configuration (app/main.py):

SENTRY_ENVIRONMENT and SENTRY_RELEASE tag events.
Data collection is privacy/cost-conservative by default and configurable per environment: SENTRY_SEND_DEFAULT_PII (default false - no user ids, client IP, headers or bodies on events), SENTRY_TRACES_SAMPLE_RATE (default 0.1) and SENTRY_PROFILING_ENABLED (default false). Raise them deliberately - full-rate tracing and PII capture are convenient in staging but may be too much (and a GDPR concern) in production. See Configuration Reference for the full list.

Strategy execution traces¶

The domain-specific signal. Every production run of a custom strategy is handled by the singleton DslExecutionObserver:

It emits the DSL Prometheus metrics above.
It persists a StrategyExecutionLog row on every error, and on successful runs with probability DSL_EXECUTION_LOG_SAMPLE_RATE (default 0.05 = 5%). Errors are always kept regardless of the rate.

A persisted row carries status, latency, node count, caseName, error code, and a bounded node-by-node trace (DSL_EXECUTION_LOG_TRACE_LIMIT, default 200 entries; tail-truncated because the early nodes usually explain why a rule matched).

Off the hot-path by design¶

The DB write does not block scoring. The observer enqueues the row onto a bounded in-process queue (DSL_EXECUTION_LOG_QUEUE_MAXSIZE, default 1000) drained by a background worker:

Scoring pays only the enqueue, never the DB round-trip.
If the database falls behind and the queue fills, rows are dropped (and counted by dsl_execution_log_dropped_total) rather than applying backpressure to scoring.
On graceful shutdown the lifespan hook flushes the queue (observer.aclose()) so buffered rows are not lost.

So a non-zero drop rate is an alert that the sink is saturated - increase the sample budget’s headroom or the queue size, or speed up the DB - but scoring itself is never the bottleneck.

Why both metrics and traces? Metrics tell you a strategy got slow or started erroring in aggregate; the sampled traces let the strategy author and the on-call engineer look back weeks later at which rule did what on a specific run, without replaying production traffic.

KPIs & operational telemetry¶

Source	Content
`KpiMetrics`	Daily rollups: total requests, success/error rate, average latency, active users, retention, average interactions per user.
`ApiRequests`	Per-request records (endpoint, status, response time, type).
`UptimeLogs`	Periodic uptime samples.

Surfaced through the dashboard and KPI endpoints:

GET /api/v1/kpi/health_check
GET /api/v1/dashboard/summary
GET /api/v1/dashboard/summary/logs
GET /api/v1/strategies/custom/{id}/metrics   # per-strategy aggregates
GET /api/v1/strategies/custom/compare        # A/B comparison

What to alert on¶

Signal	Why it matters
`dsl_execution_errors_total` rising	A published strategy is failing - users aren’t being scored as intended.
`dsl_execution_duration_seconds` p99 near 500 ms	Strategies are approaching the wall-clock limit; some events may be rejected.
`dsl_execution_log_dropped_total` > 0	The trace sink is saturated; you’re losing audit visibility (scoring is fine).
HTTP `5xx` rate	Backend errors; correlate with Sentry and the `Logs` audit trail.