Skip to content

Cassandra: conductor-server fails startup with "session is closed" during getAllTaskDefsFromDB #1135

@nthmost-orkes

Description

@nthmost-orkes

Symptom

When bringing up the Cassandra-backed docker stack (docker/docker-compose-cassandra-es7.yaml), conductor-server starts but never becomes healthy. During its boot-time initialization, the Cassandra driver session closes mid-query, causing repeated startup failures:

Caused by: com.datastax.driver.core.exceptions.DriverInternalError:
  Unexpected exception thrown
    at com.netflix.conductor.cassandra.dao.CassandraMetadataDAO.getAllTaskDefsFromDB(CassandraMetadataDAO.java:358)
Caused by: java.lang.IllegalStateException: Could not send request, session is closed
    at com.datastax.driver.core.SessionManager.execute(SessionManager.java:701)
    at com.netflix.conductor.cassandra.dao.CassandraMetadataDAO.getAllTaskDefsFromDB(CassandraMetadataDAO.java:358)

/health never returns OK. Even with a 600s wait, the service does not recover.

Reproduction

cd conductor
docker compose -f docker/docker-compose-cassandra-es7.yaml build
docker compose -f docker/docker-compose-cassandra-es7.yaml up
# wait — conductor-server logs will show the stack above repeatedly
curl -i http://localhost:8000/health   # never returns 200

Reproduced on conductor main and on the branch I was testing (feat/webhooks-from-orkes-split).

Probable root cause (both likely contribute)

  1. Cqlsh-passes-but-not-ready startup race. The compose healthcheck on conductor-cassandra is cqlsh -e \"describe keyspaces\". That succeeds the moment the native protocol port accepts connections, well before keyspace creation / schema initialization settles. conductor-server boots on the assumption Cassandra is ready, opens a driver session, and the underlying connection gets torn down before the first real query lands.

  2. Tight heap settings. The compose sets MAX_HEAP_SIZE=512M / HEAP_NEWSIZE=128M on cassandra:4. Borderline on modern JVMs. Under boot load the driver may observe intermittent disconnects from GC pauses.

Suggested fixes (cheapest first)

  1. Bump cassandra heap in the compose to at least MAX_HEAP_SIZE=1G / HEAP_NEWSIZE=256M.
  2. Tighten the cassandra healthcheck so it only reports healthy after the application keyspace exists — e.g. cqlsh -e \"SELECT keyspace_name FROM system_schema.keyspaces WHERE keyspace_name='conductor'\" with generous retries.
  3. In CassandraBaseDAO (or wherever the startup session is opened), add bounded retry around the first query so a single transient disconnect doesn't kill the whole process.

Impact

Blocks any docker-based Cassandra deployment of conductor-server. Surfaced while validating a separate PR's persistence matrix.

How I found this

Running a new composite-workflow stress harness against each persistence backend (postgres, mysql, redis, cassandra) for a feature PR. Postgres and Redis passed cleanly; MySQL fails for a separate pre-existing reason (#1104); Cassandra fails as described above. The failing code path (CassandraMetadataDAO.getAllTaskDefsFromDB) is pre-existing core conductor code, not introduced by the PR I was testing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions