Skip to content

Self-observability

Ampora dogfoods OpenTelemetry — the same protocol it manages on the agents is the one it uses to expose its own internal state. Three signal types are exported:

  • Metrics via OTLP gRPC and a Prometheus scrape endpoint.
  • Traces via OTLP gRPC, with W3C Trace Context propagated across the dispatch backplane and the live-update bus.
  • Structured logs via stdout in JSON form, ready for any log shipper.

Configuration

OpenTelemetry:
  ServiceName: ampora-server
  OtlpEndpoint: http://otelcol.observability.svc.cluster.local:4317
  Headers:
    Authorization: "Bearer ..."        # optional, e.g. for vendor auth
  SamplingRatio: 0.05                  # 5 % of traces. Bump to 1.0 in dev.

Empty OtlpEndpoint disables export — Ampora keeps internal counters working but nothing leaves the process. Useful for air-gapped deployments or smoke tests.

Headline metrics

These are the metrics most likely to wake someone up; the full list is on Reference → Metrics & traces.

OpAMP

Metric Type Use
ampora_opamp_active_sessions Gauge Currently connected agents
ampora_opamp_frames_received_total Counter Inbound frame throughput
ampora_opamp_frames_rejected_total Counter Frames refused (oversized, malformed, capability mismatch)
ampora_opamp_apply_duration_seconds Histogram Time from server-push to agent-APPLIED

Dispatch backplane

Metric Type Use
ampora_dispatch_local_total Counter Same-instance dispatches
ampora_dispatch_forwarded_total Counter Cross-instance dispatches
ampora_dispatch_no_session_total Counter Targeted agent had no live session — alert on growth

Governance

Metric Type Use
ampora_policy_evaluations_total{outcome} Counter Allow/deny rate of policies
ampora_policy_denied_total{policy} Counter Per-policy denial rate
ampora_audit_archived_total / _purged_total Counter Retention sweep work

Live update bus

Metric Type Use
ampora_live_bus_events_total{type,direction} Counter UI event traffic
ampora_live_bus_dropped_total Counter Subscriber overflow — alert if non-zero

Tracing

Every request carries a W3C Trace Context. The interesting cross-cutting spans:

  • opamp.frame_received (server) → opamp.dispatch (peer instance) → opamp.frame_sent (peer instance) → opamp.apply_status (back). Connecting these gives you "operator clicked X → agent applied Y" as a single trace.
  • live_update.publishlive_update.consume (every operator subscriber) lets you measure UI propagation latency.
  • gitops.sync traces a single repo sweep.

Sampling ratio defaults to 5 % to keep the data volume sane. Crank to 1.0 when reproducing a hard-to-pin-down issue.

Logging

JSON to stdout. Each line includes:

  • timestamp (UTC, ISO-8601 with millis),
  • level (Information / Warning / Error / Critical),
  • category (the .NET log category, e.g. Ampora.Fleet.Rollouts.RolloutService),
  • message,
  • trace_id and span_id when within an active trace,
  • tenant_id when within a tenant scope,
  • arbitrary structured properties.

Pipe into any log shipper. Common destinations: Loki (with the OpenTelemetry Collector's loki exporter), Elasticsearch, Datadog Logs.

Built-in dashboards

The deploy/dev/ folder ships:

  • prometheus.yml — a working Prometheus scrape config.
  • otel-collector-config.yaml — receives OTLP, exports to Prometheus and Jaeger.

These are starting points, not production-grade. For production, use your existing observability stack and point Ampora at it.

A small Grafana dashboard JSON suitable for import is on Reference → Metrics & traces.

Alerts you should configure

At a minimum:

  • ampora_dispatch_no_session_total rate > 0 for 5 minutes — agents are missing dispatches.
  • ampora_opamp_apply_duration_seconds p95 > 30 s — agents are applying slowly. Either they're large fleets at the moment, or there's a network problem.
  • ampora_policy_denied_total{policy="default-deny"} rate > 0 — your default-deny is firing in production, indicates a config mismatch.
  • up{job="ampora-web"} < replicas — instances are missing.

What is not exported

Ampora deliberately does not export:

  • agent telemetry payloads (those are not the management plane's data),
  • raw configuration content (might contain secrets in environment variables — exposing it in metric labels is wrong),
  • per-user activity (that is what the audit log is for, with proper retention; exporting to traces would create a parallel uncontrolled audit stream).