============= Observability ============= .. admonition:: Who is this page for? :class: note Operators running GAME and contributors adding instrumentation. Pairs with :doc:`operations` (how to run the stack) and :doc:`configuration` (every knob). The four signals ================ GAME exposes the usual observability signals plus a domain-specific one (strategy execution traces): .. list-table:: :header-rows: 1 :widths: 24 76 * - Signal - Where * - **Metrics** - Prometheus ``/metrics`` (HTTP + DSL counters). * - **Logs** - Structured stdout (JSON in prod/stage) + persisted ``Logs`` audit trail. * - **Errors** - Sentry (when ``SENTRY_DSN`` is set). * - **Traces (domain)** - Sampled ``StrategyExecutionLog`` rows for custom-strategy runs. Metrics (Prometheus) ==================== When ``METRICS_ENABLED=true`` (the default), the app mounts ``prometheus_fastapi_instrumentator`` at ``/metrics``. It is wired *after* CORS but *before* the routers, so it observes every request middleware yet does not sit behind router-level auth - i.e. ``/metrics`` itself is unguarded at the app level and must be protected at the ingress (or disabled) in production. You get out of the box: * **HTTP metrics** - request counts, durations, and status classes (status codes are grouped; untemplated paths and ``/metrics`` are excluded). * **DSL metrics** - custom counters defined in ``app/engine/dsl_metrics.py``, which live in the default ``prometheus_client`` registry and are therefore exported automatically: .. list-table:: :header-rows: 1 :widths: 46 54 * - Metric - Meaning * - ``dsl_execution_duration_seconds`` - Histogram of custom-strategy execution wall-clock. * - ``dsl_execution_nodes_total`` - AST nodes visited per run (cost signal). * - ``dsl_execution_errors_total`` - Failed strategy executions, by error type. * - ``dsl_execution_log_dropped_total`` - Execution-log rows dropped because the persistence queue was full (see below). **Non-zero = the sink is saturated**, not that scoring is at risk. The bundled Compose stack ships a pre-configured Prometheus that scrapes these without extra wiring. Logging ======= Logging is configured at startup (``app/main.py``): * **Format** - plain text in ``dev``; structured **JSON** in ``prod``/``stage`` (via ``python-json-logger``), with renamed fields (``timestamp``/``level``/``logger``) ready for ingestion. * **Level** - ``LOG_LEVEL`` env var (default ``INFO``). * **Scope** - root plus the uvicorn/gunicorn loggers, all to stdout (so a container platform collects them). On top of stdout logging, the **audit trail** (``AuditLogger`` / ``app/util/add_log.py``) writes structured ``Logs`` rows tagged with the module, level, message, ``api_key``, ``oauth_user_id``, and a correlation id - so a request can be reconstructed from the database, not just the log stream. .. admonition:: The "Network Error" trap :class: warning A dashboard *"Network Error"* with no HTTP status is almost always a backend ``500`` whose body the browser dropped. The middleware ordering ensures the ``500`` *does* carry CORS headers; check the API logs (``docker logs GAME_API_DEV``) for the real traceback. See :doc:`architecture`. Error tracking (Sentry) ======================= Set ``SENTRY_DSN`` to enable Sentry. Configuration (``app/main.py``): * ``SENTRY_ENVIRONMENT`` and ``SENTRY_RELEASE`` tag events. * ``send_default_pii=True`` and ``traces_sample_rate=1.0`` are set; continuous profiling auto-starts. **Review these for your privacy/cost posture** before enabling in production - full-rate tracing and PII capture are convenient in staging but may be too much at scale. Strategy execution traces ========================= The domain-specific signal. Every production run of a **custom** strategy is handled by the singleton ``DslExecutionObserver``: #. It emits the DSL Prometheus metrics above. #. It persists a ``StrategyExecutionLog`` row **on every error**, and on **successful** runs with probability ``DSL_EXECUTION_LOG_SAMPLE_RATE`` (default ``0.05`` = 5%). Errors are always kept regardless of the rate. A persisted row carries status, latency, node count, ``caseName``, error code, and a **bounded** node-by-node trace (``DSL_EXECUTION_LOG_TRACE_LIMIT``, default 200 entries; tail-truncated because the early nodes usually explain *why* a rule matched). Off the hot-path by design -------------------------- The DB write does **not** block scoring. The observer enqueues the row onto a bounded in-process queue (``DSL_EXECUTION_LOG_QUEUE_MAXSIZE``, default 1000) drained by a background worker: * Scoring pays only the **enqueue**, never the DB round-trip. * If the database falls behind and the queue fills, rows are **dropped** (and counted by ``dsl_execution_log_dropped_total``) rather than applying backpressure to scoring. * On graceful shutdown the lifespan hook **flushes** the queue (``observer.aclose()``) so buffered rows are not lost. So a non-zero drop rate is an alert that the *sink* is saturated - increase the sample budget's headroom or the queue size, or speed up the DB - but scoring itself is never the bottleneck. Why both metrics and traces? Metrics tell you a strategy got slow or started erroring *in aggregate*; the sampled traces let the strategy author and the on-call engineer look back weeks later at *which rule did what on a specific run*, without replaying production traffic. KPIs & operational telemetry ============================ .. list-table:: :header-rows: 1 :widths: 26 74 * - Source - Content * - ``KpiMetrics`` - Daily rollups: total requests, success/error rate, average latency, active users, retention, average interactions per user. * - ``ApiRequests`` - Per-request records (endpoint, status, response time, type). * - ``UptimeLogs`` - Periodic uptime samples. Surfaced through the dashboard and KPI endpoints: .. code-block:: bash GET /api/v1/kpi/health_check GET /api/v1/dashboard/summary GET /api/v1/dashboard/summary/logs GET /api/v1/strategies/custom/{id}/metrics # per-strategy aggregates GET /api/v1/strategies/custom/compare # A/B comparison What to alert on ================ .. list-table:: :header-rows: 1 :widths: 40 60 * - Signal - Why it matters * - ``dsl_execution_errors_total`` rising - A published strategy is failing - users aren't being scored as intended. * - ``dsl_execution_duration_seconds`` p99 near 500 ms - Strategies are approaching the wall-clock limit; some events may be rejected. * - ``dsl_execution_log_dropped_total`` > 0 - The trace sink is saturated; you're losing audit visibility (scoring is fine). * - HTTP ``5xx`` rate - Backend errors; correlate with Sentry and the ``Logs`` audit trail.