Self-observability¶

Ampora dogfoods OpenTelemetry — the same protocol it manages on the agents is the one it uses to expose its own internal state. Three signal types are exported:

Metrics via OTLP gRPC and a Prometheus scrape endpoint.
Traces via OTLP gRPC, with W3C Trace Context propagated across the dispatch backplane and the live-update bus.
Structured logs via stdout in JSON form, ready for any log shipper.

Configuration¶

OpenTelemetry:
  ServiceName: ampora-server
  OtlpEndpoint: http://otelcol.observability.svc.cluster.local:4317
  Headers:
    Authorization: "Bearer ..."        # optional, e.g. for vendor auth
  SamplingRatio: 0.05                  # 5 % of traces. Bump to 1.0 in dev.

Empty OtlpEndpoint disables export — Ampora keeps internal counters working but nothing leaves the process. Useful for air-gapped deployments or smoke tests.

Headline metrics¶

These are the metrics most likely to wake someone up; the full list is on Reference → Metrics & traces.

OpAMP¶

Metric	Type	Use
`ampora_opamp_active_sessions`	Gauge	Currently connected agents
`ampora_opamp_frames_received_total`	Counter	Inbound frame throughput
`ampora_opamp_frames_rejected_total`	Counter	Frames refused (oversized, malformed, capability mismatch)
`ampora_opamp_apply_duration_seconds`	Histogram	Time from server-push to agent-`APPLIED`

Dispatch backplane¶

Metric	Type	Use
`ampora_dispatch_local_total`	Counter	Same-instance dispatches
`ampora_dispatch_forwarded_total`	Counter	Cross-instance dispatches
`ampora_dispatch_no_session_total`	Counter	Targeted agent had no live session — alert on growth

Governance¶

Metric	Type	Use
`ampora_policy_evaluations_total{outcome}`	Counter	Allow/deny rate of policies
`ampora_policy_denied_total{policy}`	Counter	Per-policy denial rate
`ampora_audit_archived_total` / `_purged_total`	Counter	Retention sweep work

Live update bus¶

Metric	Type	Use
`ampora_live_bus_events_total{type,direction}`	Counter	UI event traffic
`ampora_live_bus_dropped_total`	Counter	Subscriber overflow — alert if non-zero

Tracing¶

Every request carries a W3C Trace Context. The interesting cross-cutting spans:

opamp.frame_received (server) → opamp.dispatch (peer instance) → opamp.frame_sent (peer instance) → opamp.apply_status (back). Connecting these gives you "operator clicked X → agent applied Y" as a single trace.
live_update.publish → live_update.consume (every operator subscriber) lets you measure UI propagation latency.
gitops.sync traces a single repo sweep.

Sampling ratio defaults to 5 % to keep the data volume sane. Crank to 1.0 when reproducing a hard-to-pin-down issue.

Logging¶

JSON to stdout. Each line includes:

timestamp (UTC, ISO-8601 with millis),
level (Information / Warning / Error / Critical),
category (the .NET log category, e.g. Ampora.Fleet.Rollouts.RolloutService),
message,
trace_id and span_id when within an active trace,
tenant_id when within a tenant scope,
arbitrary structured properties.

Pipe into any log shipper. Common destinations: Loki (with the OpenTelemetry Collector's loki exporter), Elasticsearch, Datadog Logs.

Built-in dashboards¶

The deploy/dev/ folder ships:

prometheus.yml — a working Prometheus scrape config.
otel-collector-config.yaml — receives OTLP, exports to Prometheus and Jaeger.

These are starting points, not production-grade. For production, use your existing observability stack and point Ampora at it.

A small Grafana dashboard JSON suitable for import is on Reference → Metrics & traces.

Alerts you should configure¶

At a minimum:

ampora_dispatch_no_session_total rate > 0 for 5 minutes — agents are missing dispatches.
ampora_opamp_apply_duration_seconds p95 > 30 s — agents are applying slowly. Either they're large fleets at the moment, or there's a network problem.
ampora_policy_denied_total{policy="default-deny"} rate > 0 — your default-deny is firing in production, indicates a config mismatch.
up{job="ampora-web"} < replicas — instances are missing.

What is not exported¶

Ampora deliberately does not export:

agent telemetry payloads (those are not the management plane's data),
raw configuration content (might contain secrets in environment variables — exposing it in metric labels is wrong),
per-user activity (that is what the audit log is for, with proper retention; exporting to traces would create a parallel uncontrolled audit stream).