app.engine.dsl_metrics module

Prometheus metrics for the DSL strategy engine.

Three metrics, scraped by the standard /metrics endpoint exposed by prometheus_fastapi_instrumentator (already in pyproject.toml):

  • dsl_execution_duration_seconds – histogram of wall-clock time per DslStrategy.calculate_points call. The bucket layout zooms in on the SLO at 250ms so the p99 alert rule is computed off a realistic bucket boundary (and not extrapolated between 0.1 and 1.0 buckets, which would mis-fire).

  • dsl_execution_nodes_total – counter incremented by the number of AST nodes the interpreter visited. Lets ops correlate cost with rule complexity, separately from time.

  • dsl_execution_errors_total – per-realm error counter. Labels carry the error code (DSL_TIMEOUT, DSL_ARITH_DIV_BY_ZERO, etc.) so a noisy realm + code combination jumps out in the dashboard.

Labels are intentionally minimal:

  • realmId lets the on-call team page only the affected tenant. Cardinality is bounded by the number of realms (small).

  • strategy_type (DSL_FULL / DSL_EXTEND) – two values total, so we can split the latency dashboard.

  • status (ok / error / timeout / limit) on the duration histogram so a healthy p99 isn’t dragged down by long error paths.

We deliberately do not label by strategyId. Strategy UUIDs are high-cardinality (one per realm × name × version) and would explode Prometheus’ index. The persisted StrategyExecutionLog covers the per-strategy view; metrics stay aggregate.

The histogram buckets matter for the alert rule. histogram_quantile linearly interpolates between bucket boundaries, so a bucket at 0.25 is required for an accurate p99 alert at 250ms.

app.engine.dsl_metrics.observe(*, realm, strategy_type, status, duration_seconds, nodes_executed, error_code=None)[source]

Single emit point so the observer in DslStrategy doesn’t duplicate the label-coercion logic.

Parameters:
  • realm (str | None)

  • strategy_type (str)

  • status (str)

  • duration_seconds (float)

  • nodes_executed (int)

  • error_code (str | None)

Return type:

None

app.engine.dsl_metrics.observe_log_dropped(*, realm, strategy_type)[source]

Record that one StrategyExecutionLog row was dropped because the background persistence queue was full. Kept separate from observe() so the hot path only touches it on the rare drop.

Parameters:
  • realm (str | None)

  • strategy_type (str)

Return type:

None