Self-observability¶
Ampora dogfoods OpenTelemetry — the same protocol it manages on the agents is the one it uses to expose its own internal state. Three signal types are exported:
- Metrics via OTLP gRPC and a Prometheus scrape endpoint.
- Traces via OTLP gRPC, with W3C Trace Context propagated across the dispatch backplane and the live-update bus.
- Structured logs via stdout in JSON form, ready for any log shipper.
Configuration¶
OpenTelemetry:
ServiceName: ampora-server
OtlpEndpoint: http://otelcol.observability.svc.cluster.local:4317
Headers:
Authorization: "Bearer ..." # optional, e.g. for vendor auth
SamplingRatio: 0.05 # 5 % of traces. Bump to 1.0 in dev.
Empty OtlpEndpoint disables export — Ampora keeps internal counters working but nothing leaves the process. Useful for air-gapped deployments or smoke tests.
Headline metrics¶
These are the metrics most likely to wake someone up; the full list is on Reference → Metrics & traces.
OpAMP¶
| Metric | Type | Use |
|---|---|---|
ampora_opamp_active_sessions | Gauge | Currently connected agents |
ampora_opamp_frames_received_total | Counter | Inbound frame throughput |
ampora_opamp_frames_rejected_total | Counter | Frames refused (oversized, malformed, capability mismatch) |
ampora_opamp_apply_duration_seconds | Histogram | Time from server-push to agent-APPLIED |
Dispatch backplane¶
| Metric | Type | Use |
|---|---|---|
ampora_dispatch_local_total | Counter | Same-instance dispatches |
ampora_dispatch_forwarded_total | Counter | Cross-instance dispatches |
ampora_dispatch_no_session_total | Counter | Targeted agent had no live session — alert on growth |
Governance¶
| Metric | Type | Use |
|---|---|---|
ampora_policy_evaluations_total{outcome} | Counter | Allow/deny rate of policies |
ampora_policy_denied_total{policy} | Counter | Per-policy denial rate |
ampora_audit_archived_total / _purged_total | Counter | Retention sweep work |
Live update bus¶
| Metric | Type | Use |
|---|---|---|
ampora_live_bus_events_total{type,direction} | Counter | UI event traffic |
ampora_live_bus_dropped_total | Counter | Subscriber overflow — alert if non-zero |
Tracing¶
Every request carries a W3C Trace Context. The interesting cross-cutting spans:
opamp.frame_received(server) →opamp.dispatch(peer instance) →opamp.frame_sent(peer instance) →opamp.apply_status(back). Connecting these gives you "operator clicked X → agent applied Y" as a single trace.live_update.publish→live_update.consume(every operator subscriber) lets you measure UI propagation latency.gitops.synctraces a single repo sweep.
Sampling ratio defaults to 5 % to keep the data volume sane. Crank to 1.0 when reproducing a hard-to-pin-down issue.
Logging¶
JSON to stdout. Each line includes:
timestamp(UTC, ISO-8601 with millis),level(Information/Warning/Error/Critical),category(the .NET log category, e.g.Ampora.Fleet.Rollouts.RolloutService),message,trace_idandspan_idwhen within an active trace,tenant_idwhen within a tenant scope,- arbitrary structured properties.
Pipe into any log shipper. Common destinations: Loki (with the OpenTelemetry Collector's loki exporter), Elasticsearch, Datadog Logs.
Built-in dashboards¶
The deploy/dev/ folder ships:
prometheus.yml— a working Prometheus scrape config.otel-collector-config.yaml— receives OTLP, exports to Prometheus and Jaeger.
These are starting points, not production-grade. For production, use your existing observability stack and point Ampora at it.
A small Grafana dashboard JSON suitable for import is on Reference → Metrics & traces.
Alerts you should configure¶
At a minimum:
ampora_dispatch_no_session_totalrate > 0 for 5 minutes — agents are missing dispatches.ampora_opamp_apply_duration_secondsp95 > 30 s — agents are applying slowly. Either they're large fleets at the moment, or there's a network problem.ampora_policy_denied_total{policy="default-deny"}rate > 0 — your default-deny is firing in production, indicates a config mismatch.up{job="ampora-web"}< replicas — instances are missing.
What is not exported¶
Ampora deliberately does not export:
- agent telemetry payloads (those are not the management plane's data),
- raw configuration content (might contain secrets in environment variables — exposing it in metric labels is wrong),
- per-user activity (that is what the audit log is for, with proper retention; exporting to traces would create a parallel uncontrolled audit stream).