Observability

Monitor Sockudo connections, subscriptions, publishes, fanout, history, recovery, webhooks, and push notifications.

Observability is part of the runtime contract. A realtime system should tell operators when it is connected, degraded, delayed, retrying, or dropping work.

Metrics endpoint

curl http://127.0.0.1:9601/metrics

Scrape the endpoint with Prometheus and label metrics by environment, region, node, adapter, and app where possible. Sockudo emits metrics through the metrics-rs recorder and exposes them through the Prometheus exporter by default.

TCP metrics exporter

For live debugging or sidecar consumers, Sockudo can also fan out metric events over TCP using metrics-exporter-tcp. This exporter streams protobuf-encoded metric events to connected clients; it is useful for local inspection and custom collectors, but Prometheus scraping should remain the primary production monitoring path.

[metrics.tcp_exporter]
enabled = true
host = "127.0.0.1"
port = 5000
buffer_size = 1024

The TCP exporter has bounded buffering by default. When buffers fill, event samples can be dropped to avoid blocking the server, so do not use it as the only source for alerts or SLO dashboards.

Core signals

Area	Watch
Connections	active sockets, connection attempts, disconnect reasons, heartbeat failures
Subscriptions	subscribe successes, auth failures, presence joins and leaves
Publish	accepted, rejected, idempotency hits, payload-too-large failures
Fanout	adapter publish latency, adapter receive latency, duplicate suppression
Recovery	resume successes, resume failures, replay counts, buffer misses
History	writes, reads, retention purges, cursor errors
Webhooks	queued, delivered, failed, retried, dead-lettered
Push	accepted, scheduled, dispatched, provider errors, publish status outcomes

AI Transport

AI Transport observability is domain-blind. Sockudo reads only the well-known extras.ai.transport headers and never labels metrics by channel, turn ID, message ID, or client ID.

Metrics exposed at /metrics include:

sockudo_ai_turns_started_total
sockudo_ai_turns_ended_total{reason}
sockudo_ai_cancel_signals_total
sockudo_ai_active_streams
sockudo_ai_stream_duration_seconds
sockudo_ai_stream_bytes_total
sockudo_ai_messages_rejected_total{code}
sockudo_ai_messages_unparseable_total
sockudo_appends_received_total
sockudo_appends_delivered_total
sockudo_rollup_ratio
sockudo_flush_latency
sockudo_history_recovery_success_total{source}
sockudo_history_recovery_failures_total{code}
sockudo_versioned_message_mutations_total{action,result}
sockudo_versioned_message_retrieval_total{surface,result}

The four production signals are turn outcomes, stream rate, rejects by code, and rollup efficiency. Use docs/public/grafana/ai-transport-observability.json as the starting Grafana dashboard.

Dashboard starter panels:

Panel	Metric
Active AI streams	`sockudo_ai_active_streams`
Turn outcomes	`sockudo_ai_turns_started_total`, `sockudo_ai_turns_ended_total`
Rejects by code	`sockudo_ai_messages_rejected_total`
Rollup efficiency	`sockudo_appends_received_total` vs `sockudo_appends_delivered_total`, `sockudo_rollup_ratio`
Rollup latency	`sockudo_flush_latency`
Recovery health	`sockudo_history_recovery_success_total`, `sockudo_history_recovery_failures_total`
Durable state	`sockudo_history_degraded_channels`, `sockudo_history_reset_required_channels`
Push backlog	push queue/status/provider metrics from the push dashboard

For support escalation, capture the channel, wall-clock time window, verified clientId, and first error code before collecting logs. Do not ask customers for provider payloads unless the codec layer explicitly requires them.

Persist completed AI turns

For Ably-style production persistence, persist completed turns from your backend instead of making Sockudo a domain store:

Enable the ai_turn_ended webhook for the app.
On reason: "complete", query the history endpoint for the channel around the webhook time.
Select messages with the same extras.ai.transport.turn-id.
Store the reduced transcript in your application database.

app.post("/sockudo/webhooks", async (req, res) => {
  for (const event of req.body.events) {
    if (event.name !== "ai_turn_ended" || event.reason !== "complete") continue;

    const history = await sockudo.channelHistory(event.channel, {
      limit: 1000,
      direction: "newest_first",
    });

    await storeCompletedTurn({
      turnId: event.turn_id,
      channel: event.channel,
      items: history.items.filter((item) => {
        return item.extras?.ai?.transport?.["turn-id"] === event.turn_id;
      }),
    });
  }
  res.sendStatus(200);
});

Logs

Use structured logs for events that operators need to investigate:

{
  "level": "warn",
  "target": "sockudo_push",
  "app_id": "app-id",
  "publish_id": "pub_123",
  "provider": "apns",
  "error": "BadDeviceToken"
}

Avoid logging secrets, raw auth signatures, provider tokens, or encrypted payloads.

Grafana dashboards

Recommended panels:

active connections by node
connection churn
publish rate by app
fanout latency histogram
subscription auth failure rate
recovery success ratio
replay buffer pressure
queue depth for webhooks and push
push provider failure rate by provider
APNs, FCM, Web Push latency by outcome
AI turn outcomes by reason
AI stream duration and active streams
AI reject codes and unparseable headers
AI append rollup efficiency

Alerts

Alert on symptoms operators can act on:

readiness failures
adapter connection loss
rising publish failures
high auth rejection rate after deploy
recovery success ratio dropping
history write failures
webhook retry backlog
push queue backlog
push provider credential failures
push delivery status callback failures

Push status workflow

Push publishes are asynchronous. Store the publish_id returned by the API when a business workflow needs support visibility.

const response = await sockudo.publishPush({
  recipients: [{ type: "channel", channel: "orders" }],
  payload: { title: "Order updated", body: "Packed" },
  sync: false,
});

console.log(response.publish_id);

Then inspect status:

curl "https://realtime.example.com/apps/app-id/push/publish/pub_123/status"

Status records should be retained long enough for customer support and incident review.

On this page