Observability¶
Who is this page for?
Operators running GAME and contributors adding instrumentation. Pairs with Operations (how to run the stack) and Configuration Reference (every knob).
The four signals¶
GAME exposes the usual observability signals plus a domain-specific one (strategy execution traces):
Signal |
Where |
|---|---|
Metrics |
Prometheus |
Logs |
Structured stdout (JSON in prod/stage) + persisted |
Errors |
Sentry (when |
Traces (domain) |
Sampled |
Metrics (Prometheus)¶
When METRICS_ENABLED=true (the default), the app mounts
prometheus_fastapi_instrumentator at /metrics. It is wired after
CORS but before the routers, so it observes every request middleware yet
does not sit behind router-level auth - i.e. /metrics itself is unguarded
at the app level and must be protected at the ingress (or disabled) in
production.
You get out of the box:
HTTP metrics - request counts, durations, and status classes (status codes are grouped; untemplated paths and
/metricsare excluded).DSL metrics - custom counters defined in
app/engine/dsl_metrics.py, which live in the defaultprometheus_clientregistry and are therefore exported automatically:
Metric |
Meaning |
|---|---|
|
Histogram of custom-strategy execution wall-clock. |
|
AST nodes visited per run (cost signal). |
|
Failed strategy executions, by error type. |
|
Execution-log rows dropped because the persistence queue was full (see below). Non-zero = the sink is saturated, not that scoring is at risk. |
The bundled Compose stack ships a pre-configured Prometheus that scrapes these without extra wiring.
Logging¶
Logging is configured at startup (app/main.py):
Format - plain text in
dev; structured JSON inprod/stage(viapython-json-logger), with renamed fields (timestamp/level/logger) ready for ingestion.Level -
LOG_LEVELenv var (defaultINFO).Scope - root plus the uvicorn/gunicorn loggers, all to stdout (so a container platform collects them).
On top of stdout logging, the audit trail (AuditLogger /
app/util/add_log.py) writes structured Logs rows tagged with the
module, level, message, api_key, oauth_user_id, and a correlation id -
so a request can be reconstructed from the database, not just the log stream.
The “Network Error” trap
A dashboard “Network Error” with no HTTP status is almost always a backend
500 whose body the browser dropped. The middleware ordering ensures the
500 does carry CORS headers; check the API logs
(docker logs GAME_API_DEV) for the real traceback. See
Architecture.
Error tracking (Sentry)¶
Set SENTRY_DSN to enable Sentry. Configuration (app/main.py):
SENTRY_ENVIRONMENTandSENTRY_RELEASEtag events.send_default_pii=Trueandtraces_sample_rate=1.0are set; continuous profiling auto-starts. Review these for your privacy/cost posture before enabling in production - full-rate tracing and PII capture are convenient in staging but may be too much at scale.
Strategy execution traces¶
The domain-specific signal. Every production run of a custom strategy is
handled by the singleton DslExecutionObserver:
It emits the DSL Prometheus metrics above.
It persists a
StrategyExecutionLogrow on every error, and on successful runs with probabilityDSL_EXECUTION_LOG_SAMPLE_RATE(default0.05= 5%). Errors are always kept regardless of the rate.
A persisted row carries status, latency, node count, caseName, error code,
and a bounded node-by-node trace (DSL_EXECUTION_LOG_TRACE_LIMIT,
default 200 entries; tail-truncated because the early nodes usually explain
why a rule matched).
Off the hot-path by design¶
The DB write does not block scoring. The observer enqueues the row onto a
bounded in-process queue (DSL_EXECUTION_LOG_QUEUE_MAXSIZE, default 1000)
drained by a background worker:
Scoring pays only the enqueue, never the DB round-trip.
If the database falls behind and the queue fills, rows are dropped (and counted by
dsl_execution_log_dropped_total) rather than applying backpressure to scoring.On graceful shutdown the lifespan hook flushes the queue (
observer.aclose()) so buffered rows are not lost.
So a non-zero drop rate is an alert that the sink is saturated - increase the sample budget’s headroom or the queue size, or speed up the DB - but scoring itself is never the bottleneck.
Why both metrics and traces? Metrics tell you a strategy got slow or started erroring in aggregate; the sampled traces let the strategy author and the on-call engineer look back weeks later at which rule did what on a specific run, without replaying production traffic.
KPIs & operational telemetry¶
Source |
Content |
|---|---|
|
Daily rollups: total requests, success/error rate, average latency, active users, retention, average interactions per user. |
|
Per-request records (endpoint, status, response time, type). |
|
Periodic uptime samples. |
Surfaced through the dashboard and KPI endpoints:
GET /api/v1/kpi/health_check
GET /api/v1/dashboard/summary
GET /api/v1/dashboard/summary/logs
GET /api/v1/strategies/custom/{id}/metrics # per-strategy aggregates
GET /api/v1/strategies/custom/compare # A/B comparison
What to alert on¶
Signal |
Why it matters |
|---|---|
|
A published strategy is failing - users aren’t being scored as intended. |
|
Strategies are approaching the wall-clock limit; some events may be rejected. |
|
The trace sink is saturated; you’re losing audit visibility (scoring is fine). |
HTTP |
Backend errors; correlate with Sentry and the |