app.engine.dsl_metrics module¶
Prometheus metrics for the DSL strategy engine.
Three metrics, scraped by the standard /metrics endpoint exposed by
prometheus_fastapi_instrumentator (already in pyproject.toml):
dsl_execution_duration_seconds– histogram of wall-clock time perDslStrategy.calculate_pointscall. The bucket layout zooms in on the SLO at 250ms so the p99 alert rule is computed off a realistic bucket boundary (and not extrapolated between 0.1 and 1.0 buckets, which would mis-fire).dsl_execution_nodes_total– counter incremented by the number of AST nodes the interpreter visited. Lets ops correlate cost with rule complexity, separately from time.dsl_execution_errors_total– per-realm error counter. Labels carry the error code (DSL_TIMEOUT,DSL_ARITH_DIV_BY_ZERO, etc.) so a noisy realm + code combination jumps out in the dashboard.
Labels are intentionally minimal:
realmIdlets the on-call team page only the affected tenant. Cardinality is bounded by the number of realms (small).strategy_type(DSL_FULL/DSL_EXTEND) – two values total, so we can split the latency dashboard.status(ok/error/timeout/limit) on the duration histogram so a healthy p99 isn’t dragged down by long error paths.
We deliberately do not label by strategyId. Strategy UUIDs are
high-cardinality (one per realm × name × version) and would explode
Prometheus’ index. The persisted StrategyExecutionLog covers the
per-strategy view; metrics stay aggregate.
The histogram buckets matter for the alert rule. histogram_quantile
linearly interpolates between bucket boundaries, so a bucket at 0.25
is required for an accurate p99 alert at 250ms.
- app.engine.dsl_metrics.observe(*, realm, strategy_type, status, duration_seconds, nodes_executed, error_code=None)[source]¶
Single emit point so the observer in
DslStrategydoesn’t duplicate the label-coercion logic.- Parameters:
realm (str | None)
strategy_type (str)
status (str)
duration_seconds (float)
nodes_executed (int)
error_code (str | None)
- Return type:
None
- app.engine.dsl_metrics.observe_log_dropped(*, realm, strategy_type)[source]¶
Record that one StrategyExecutionLog row was dropped because the background persistence queue was full. Kept separate from
observe()so the hot path only touches it on the rare drop.- Parameters:
realm (str | None)
strategy_type (str)
- Return type:
None