Sockudo
Server

AI Transport troubleshooting

Common AI Transport production issues and operator actions.

SymptomLikely causeOperator action
Mutations rejected as not permittedCapability lacks message_*_own or message_*_any, or actor identity does not match ownerInspect the verified connection identity and token capability map; do not rely on body client_id.
History retention too shortRewind or recovery points before retained durable historyIncrease [history] retention or lower expected rewind/recovery window; watch retained gauges.
Turn never endsMissing terminal append/update/summary or orphan TTL too highCheck sockudo_ai_active_streams, terminal event rates, and ai_transport.rollup.orphan_ttl_ms.
Suspended-state publishesProducer continued after suspend/continuation boundaryReject or route through the continuation flow; audit message_id and op_id.
Reconnect loop on bad tokensClient retries expired/revoked/bad JWT without refreshFix token refresh, rate-limit retry storms, revoke compromised jti or client_id.
Rollup latency too highWindow too high or slow flush under loadCompare sockudo_flush_latency to the configured window and lower default_window_ms.
Rollup savings too lowWindow too low or terminal flushes dominateCompare sockudo_appends_received_total, sockudo_appends_delivered_total, and sockudo_rollup_ratio.
Push channel misses userDevice not registered or not subscribed to the notification channelInspect push channel subscriptions and provider token state.
Presence flaps during transient lossUngraceful timeout is zero or too lowRaise [presence].ungraceful_timeout_seconds for agent presence channels.
Recovery fails after node lossDurable state degraded/reset-required or shared cache/store unavailableInspect sockudo_history_recovery_failures_total, degraded/reset-required gauges, and store health.

For scale and chaos-specific playbooks, see AI Transport production operations.