Skip to content

Observability

Photon ships with OpenTelemetry instrumentation baked in — traces, metrics, and logs are emitted automatically during every tool call. By default the instrumentation is a no-op. Set one environment variable and install the OTel SDK to route everything to any OTLP-compatible backend (Jaeger, Grafana Tempo, SigNoz, Honeycomb, DataDog, …).

Quick start

bash
# 1. Install the OTel SDK as a peer dep of your deployment
bun add @opentelemetry/sdk-node @opentelemetry/api \
        @opentelemetry/exporter-trace-otlp-http \
        @opentelemetry/exporter-metrics-otlp-http \
        @opentelemetry/exporter-logs-otlp-http \
        @opentelemetry/resources \
        @opentelemetry/semantic-conventions

# 2. Point at a collector
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
export OTEL_SERVICE_NAME=my-photon-service

# 3. Run anything — CLI, MCP server, Beam — they all export now
photon cli kanban stats
photon beam
photon mcp my-photon

That is the entire setup. Photon's CLI entry point calls initOtelSdk() before any command runs, so the SDK is live before the first span is created.

What gets exported

Traces

SpanAttributes
gen_ai.tool.call {photon}.{tool}gen_ai.tool.name, gen_ai.agent.name, gen_ai.operation.name, photon.instance, photon.caller, photon.trace_id, photon.stateful
gen_ai.agent.invoke {photon}gen_ai.agent.name, gen_ai.operation.name, gen_ai.agent.description
  • Error spans auto-set sampling.priority=1 so they survive head-based sampling.
  • recordException captures the full stack trace via Error.cause.
  • _meta.traceparent on a tool call makes the new span a child of the incoming W3C context — distributed traces chain automatically.
  • Nested this.call() propagates trace context via _meta.traceparent, so multi-photon workflows show up as one connected trace.

Metrics

InstrumentTypeUnitAttributes
photon.tool.durationhistogrammsgen_ai.agent.name, gen_ai.tool.name, status, photon.stateful, photon.error_type
photon.tool.callscounter1same
photon.tool.errorscounter1same
photon.circuit_breaker.transitionscounter1gen_ai.agent.name, gen_ai.tool.name, from, to, photon.instance
photon.rate_limit.rejectionscounter1gen_ai.agent.name, gen_ai.tool.name, photon.instance
photon.bulkhead.rejectionscounter1gen_ai.agent.name, gen_ai.tool.name, photon.instance

Structured error responses

When a tool call fails, the MCP response sets isError: true and attaches a machine-readable payload so agents can make typed retry decisions:

json
{
  "content": [{ "type": "text", "text": "Tool Error: add ..." }],
  "isError": true,
  "structuredContent": {
    "error": {
      "type": "circuit_open",
      "retryable": true,
      "message": "Circuit open: add has failed 5 consecutive times. Resets in 12s"
    }
  },
  "_meta": { "photon": { "type": "...", "retryable": true, "message": "..." } }
}

Error type values: validation_error, timeout_error, network_error, permission_error, not_found_error, circuit_open, rate_limited, bulkhead_full, implementation_error, runtime_error. retryable is true for transient failures (circuit_open, timeout, network) and false for deterministic ones (validation, permission, not_found).

Logs

Every record emitted by the photon Logger is forwarded to the OTel logs bridge with severity mapped to the standard scale and ambient context auto-attached: photon.name, photon.tool, photon.trace_id, photon.caller_id. This gives trace-log correlation in any OTLP backend without a pino dependency.

Running a local collector

For development, the easiest stack is Grafana's OTel-LGTM image — it includes Tempo (traces), Mimir (metrics), and Loki (logs) behind a single Grafana pane:

bash
docker run -p 3000:3000 -p 4317:4317 -p 4318:4318 \
  --rm -ti grafana/otel-lgtm

Then OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 and open http://localhost:3000. Every tool call will show up in the "Explore" view under the photon service.

For Jaeger-only traces:

bash
docker run -p 16686:16686 -p 4318:4318 \
  jaegertracing/all-in-one:latest

Health checks

Beam exposes two HTTP endpoints for liveness/readiness probes:

EndpointPurpose
GET /api/healthTop-level status with per-subsystem breakdown. Returns 200 when all subsystems are ok, 503 when any is degraded. Mount as Kubernetes livenessProbe + readinessProbe.
GET /api/health/circuitsPer-circuit state (closed/open/half-open) plus summary counts. Useful for ops dashboards.

The /api/health body shape:

json
{
  "status": "ok",
  "uptime_s": 1234,
  "timestamp": "2026-04-15T07:50:00.000Z",
  "subsystems": {
    "runtime":  { "status": "ok" },
    "photons":  { "status": "ok", "detail": "5 photon(s) loaded" },
    "circuits": { "status": "ok", "detail": "0 open of 3 tracked" }
  }
}

Environment variables

VariablePurpose
OTEL_EXPORTER_OTLP_ENDPOINTMaster switch. Unset = SDK stays disabled.
OTEL_EXPORTER_OTLP_PROTOCOLgrpc, http/protobuf, or http/json. Defaults per SDK.
OTEL_SERVICE_NAMEService name attached to every record.
OTEL_RESOURCE_ATTRIBUTESExtra resource attributes (e.g. deployment.environment=prod).
OTEL_TRACES_SAMPLEROverride the default sampler. Error spans are always kept.
OTEL_LOG_LEVELSDK's own log level.

These are standard OTel variables; Photon doesn't invent any of them.

AG-UI integration

AG-UI clients running against Photon benefit from everything in this guide automatically:

  • Structured errorsRUN_ERROR events include code (error type), retryable (boolean), runId, and threadId, so clients can classify failures and auto-retry transient ones without bubbling to the user.
  • Trace correlation — every AG-UI event carries rawEvent.traceparent, letting clients deep-link a UI event to the backing span in Jaeger/Tempo.
  • Capability handshake — MCP initialize advertises experimental['ag-ui'].features (structured-errors, trace-correlation, proxy-mode, local-mode) so clients can negotiate without probing.

Philosophy

Photon's instrumentation is intentionally zero-dependency: the runtime only imports @opentelemetry/* via dynamic import() inside try/catch, so unused installs stay tiny. The value-add of this guide is that once you do install the SDK and set one env var, every promise in PROMISES.md Intent 10 becomes observable with no photon-specific configuration.

Released under the MIT License.