Symptom
When bringing up the Cassandra-backed docker stack (docker/docker-compose-cassandra-es7.yaml), conductor-server starts but never becomes healthy. During its boot-time initialization, the Cassandra driver session closes mid-query, causing repeated startup failures:
Caused by: com.datastax.driver.core.exceptions.DriverInternalError:
Unexpected exception thrown
at com.netflix.conductor.cassandra.dao.CassandraMetadataDAO.getAllTaskDefsFromDB(CassandraMetadataDAO.java:358)
Caused by: java.lang.IllegalStateException: Could not send request, session is closed
at com.datastax.driver.core.SessionManager.execute(SessionManager.java:701)
at com.netflix.conductor.cassandra.dao.CassandraMetadataDAO.getAllTaskDefsFromDB(CassandraMetadataDAO.java:358)
/health never returns OK. Even with a 600s wait, the service does not recover.
Reproduction
cd conductor
docker compose -f docker/docker-compose-cassandra-es7.yaml build
docker compose -f docker/docker-compose-cassandra-es7.yaml up
# wait — conductor-server logs will show the stack above repeatedly
curl -i http://localhost:8000/health # never returns 200
Reproduced on conductor main and on the branch I was testing (feat/webhooks-from-orkes-split).
Probable root cause (both likely contribute)
-
Cqlsh-passes-but-not-ready startup race. The compose healthcheck on conductor-cassandra is cqlsh -e \"describe keyspaces\". That succeeds the moment the native protocol port accepts connections, well before keyspace creation / schema initialization settles. conductor-server boots on the assumption Cassandra is ready, opens a driver session, and the underlying connection gets torn down before the first real query lands.
-
Tight heap settings. The compose sets MAX_HEAP_SIZE=512M / HEAP_NEWSIZE=128M on cassandra:4. Borderline on modern JVMs. Under boot load the driver may observe intermittent disconnects from GC pauses.
Suggested fixes (cheapest first)
- Bump cassandra heap in the compose to at least
MAX_HEAP_SIZE=1G / HEAP_NEWSIZE=256M.
- Tighten the cassandra healthcheck so it only reports healthy after the application keyspace exists — e.g.
cqlsh -e \"SELECT keyspace_name FROM system_schema.keyspaces WHERE keyspace_name='conductor'\" with generous retries.
- In
CassandraBaseDAO (or wherever the startup session is opened), add bounded retry around the first query so a single transient disconnect doesn't kill the whole process.
Impact
Blocks any docker-based Cassandra deployment of conductor-server. Surfaced while validating a separate PR's persistence matrix.
How I found this
Running a new composite-workflow stress harness against each persistence backend (postgres, mysql, redis, cassandra) for a feature PR. Postgres and Redis passed cleanly; MySQL fails for a separate pre-existing reason (#1104); Cassandra fails as described above. The failing code path (CassandraMetadataDAO.getAllTaskDefsFromDB) is pre-existing core conductor code, not introduced by the PR I was testing.
Symptom
When bringing up the Cassandra-backed docker stack (
docker/docker-compose-cassandra-es7.yaml),conductor-serverstarts but never becomes healthy. During its boot-time initialization, the Cassandra driver session closes mid-query, causing repeated startup failures:/healthnever returns OK. Even with a 600s wait, the service does not recover.Reproduction
Reproduced on conductor
mainand on the branch I was testing (feat/webhooks-from-orkes-split).Probable root cause (both likely contribute)
Cqlsh-passes-but-not-ready startup race. The compose healthcheck on
conductor-cassandraiscqlsh -e \"describe keyspaces\". That succeeds the moment the native protocol port accepts connections, well before keyspace creation / schema initialization settles.conductor-serverboots on the assumption Cassandra is ready, opens a driver session, and the underlying connection gets torn down before the first real query lands.Tight heap settings. The compose sets
MAX_HEAP_SIZE=512M / HEAP_NEWSIZE=128Moncassandra:4. Borderline on modern JVMs. Under boot load the driver may observe intermittent disconnects from GC pauses.Suggested fixes (cheapest first)
MAX_HEAP_SIZE=1G / HEAP_NEWSIZE=256M.cqlsh -e \"SELECT keyspace_name FROM system_schema.keyspaces WHERE keyspace_name='conductor'\"with generousretries.CassandraBaseDAO(or wherever the startup session is opened), add bounded retry around the first query so a single transient disconnect doesn't kill the whole process.Impact
Blocks any docker-based Cassandra deployment of conductor-server. Surfaced while validating a separate PR's persistence matrix.
How I found this
Running a new composite-workflow stress harness against each persistence backend (postgres, mysql, redis, cassandra) for a feature PR. Postgres and Redis passed cleanly; MySQL fails for a separate pre-existing reason (#1104); Cassandra fails as described above. The failing code path (
CassandraMetadataDAO.getAllTaskDefsFromDB) is pre-existing core conductor code, not introduced by the PR I was testing.