Sockudo
Server

AI Transport production operations

Runbooks, capacity formulas, alerts, chaos playbooks, and troubleshooting for AI Transport at scale.

This runbook covers the S14 scale, soak, and chaos validation path for AI Transport deployments. Use it with the profiles in test/load/profiles, the runner in test/load/ai-scale-runner.mjs, and the chaos harness in tools/chaos.

Capacity Model

Start with these formulas, then replace the inputs with measured numbers from your hardware:

DimensionFormula
Connections per nodetarget_connections / node_count, with 20 percent headroom reserved
Active streams per nodeactive_streams / node_count, with shard imbalance budgeted at p95
Append ingress rateactive_streams * tokens_per_second
Fanout egress rateappend_ingress_rate * mean_subscribers_per_session
Rollup memory ceilingactive_streams_per_node * 4 KiB, excluding payload buffers
History bytes per responsemean_response_tokens * mean_token_bytes * retention_multiplier
History store write rateappend_ingress_rate + create/update/delete/summary_rate

The headline target is 1M concurrent connections, 50k active streams, 100 tok/s, and one to five subscriber devices per session for a 30-minute run. The 24-hour soak target is 20 percent of that load.

Required Runs

Run the five-node local smoke first:

make ai-scale-smoke

Generate a production fleet plan:

node test/load/ai-scale-runner.mjs --profile test/load/profiles/headline-1m.json --plan

Execute only on prepared hardware:

node test/load/ai-scale-runner.mjs \
  --profile test/load/profiles/headline-1m.json \
  --execute \
  --urls "$SOCKUDO_NODE_URLS" \
  --metricsUrls "$SOCKUDO_METRICS_URLS" \
  --output docs/specs/ai-transport-results/headline-1m.json

The required pass criteria are zero sampled transcript mismatches, clean serial monotonicity, p99 append-to-delivery under 25 ms intra-region, flat memory after warm-up, and no file descriptor or Tokio task leaks.

Chaos Playbooks

Run all local chaos scenarios:

tools/chaos/ai-chaos-runner.sh all

Node kill mid-stream: confirm post-heal transcript audits pass, sockudo_ai_active_streams returns to the expected level, and janitor cleanup leaves no unbounded active rollup stream growth.

Redis restart or failover: expect a bounded stall, no versioned-message corruption, and no durable history reset-required channels. Watch sockudo_horizontal_transport_reconnections_total, sockudo_horizontal_transport_messages_dropped_total, sockudo_history_degraded_channels, and sockudo_history_reset_required_channels.

Inter-node partition: verify S8 duplicate/delay semantics, then confirm healed nodes converge through sampled latest reads and transcript audits.

Slow subscribers: validate that the existing slow-consumer policy protects hot channels and that sockudo_broadcast_latency_ms recovers after the pressure phase.

Clock skew: the local script verifies signed client timestamp skew does not affect serial monotonicity. True node clock mutation must run on privileged production-like infrastructure; serial reservation is u64 store-backed and must not depend on wall-clock ordering.

Alert Thresholds

Use these as starting alerts and tune with measured baselines:

SignalAlert
sockudo_connectedabove planned per-node budget for 5 minutes
sockudo_rate_limit_triggered_totalreconnect storm lane does not trigger shaping, or triggers remain elevated after 10 minutes
sockudo_broadcast_latency_msp99 above 25 ms during headline run or above 10 ms during single-node fanout benchmark
sockudo_ai_active_streamsdoes not return to baseline after stream TTL plus janitor window
sockudo_appends_received_total vs sockudo_appends_delivered_totaldelivered ratio drops outside configured rollup expectations
sockudo_flush_latency99.9 percent exceeds rollup window plus 5 ms
sockudo_history_recovery_failures_totalany sustained non-auth failure during reconnect tests
sockudo_history_degraded_channelsnonzero for more than one scrape interval
sockudo_history_reset_required_channelsany nonzero value pages
sockudo_horizontal_transport_messages_dropped_totalany increase outside explicit partition tests
sockudo_horizontal_transport_queue_depthsustained growth across three scrape intervals
sockudo_tokio_active_taskspositive slope after warm-up in soak

Troubleshooting

SymptomLikely causeOperator action
Mutations rejected as not permittedApp/channel capability lacks the create/update/delete/append actionInspect token capability and server app policy; client-supplied identity is not authoritative
History retention too shortRewind or recovery request points before retained durable historyIncrease retention count/time or reduce expected rewind window; watch retained message/byte gauges
Turn never endsMissing terminal append/update/summary or rollup orphan TTL too longCheck sockudo_ai_active_streams, terminal event rates, and client turn lifecycle
Suspended-state publishesProducer continued after suspend/continuation boundaryReject or route through the continuation flow; audit op_id/message_id idempotency
Reconnect loop on bad tokensClients retrying auth failures without refreshing credentialsRate-limit, inspect auth error codes, and fix token refresh logic
Rollup mis-tuningWindow too high for latency budget or too low for fanout reductionCompare sockudo_flush_latency and sockudo_rollup_ratio; tune the window per channel class

On this page