Description
Observed behavior
We are using MQTT in a single-node NATS deployment. There are sudden spikes in JetStream API failures, which cause connection issues, subscription failures, and message publishing failures. This occurs multiple times per day, making it a high-frequency failure event. A clean restart resolves the issue. During these incidents, there are no anomalies in the CPU or memory metrics.
System Details
Instance Details:
CPU: 32 cores
Memory: 128GB
Disk Storage: 50GB
Utilization:
CPU: 2 cores
Memory: 1GB
Disk: 150MB
Number of MQTT connections: 3,000
Number of MQTT subscriptions: 6,000 (QoS 1)
Messages produced: ~30 RPS across all topics
A single NATS queue group subscription is used to consume MQTT-published messages on one topic.
Associated Logs:
-
mid: 102204 - "cae2bc80-7142-11ec-b9b8-33dad110a235" - Unable to persist session "cae2bc80-7142-11ec-b9b8-33dad110a235" (seq=70876): Timeout after 4.000022403s. Request type "SP" on "$MQTT.sess.RT45Zasv" (reply="$MQTT.JSA.S1Nunr6R.SP.RT45Zasv.1iHZZPsxA2EXBvLS043jtn").
-
mid: 116735 - "KkuRAJeYH02G8HqecxCiAW" - Unable to add JetStream consumer for subscription on "abcd.user.8a7d3311-4040-40b8-955d-834ce54b8c15": Error - Timeout after 4.000826922s. Request type "CC" on "$JS.API.CONSUMER.DURABLE.CREATE.$MQTT_msgs.51r4DC1W_KkuRAJeYH02G8HqecxLU1k" (reply="$MQTT.JSA.S1Nunr6R.CC.1iHZZPsxA2EXBvLS043jic").
-
mid: 84480647 - "mqttjs_a1346563" - Read loop processing time: 5.011585369s.
Another observation is that CPU usage never exceeded 2 cores, despite allocating 32 cores. Could this indicate a potential resource bottleneck?
Expected behavior
No connection/sub/pub failures
Server and client version
Nats Server version 2.10.22
Host environment
Kubernetes v1.25
Steps to reproduce
No response