Cassandra: conductor-server fails startup with "session is closed" during getAllTaskDefsFromDB

### Symptom

When bringing up the Cassandra-backed docker stack (`docker/docker-compose-cassandra-es7.yaml`), `conductor-server` starts but never becomes healthy. During its boot-time initialization, the Cassandra driver session closes mid-query, causing repeated startup failures:

```
Caused by: com.datastax.driver.core.exceptions.DriverInternalError:
  Unexpected exception thrown
    at com.netflix.conductor.cassandra.dao.CassandraMetadataDAO.getAllTaskDefsFromDB(CassandraMetadataDAO.java:358)
Caused by: java.lang.IllegalStateException: Could not send request, session is closed
    at com.datastax.driver.core.SessionManager.execute(SessionManager.java:701)
    at com.netflix.conductor.cassandra.dao.CassandraMetadataDAO.getAllTaskDefsFromDB(CassandraMetadataDAO.java:358)
```

`/health` never returns OK. Even with a 600s wait, the service does not recover.

### Reproduction

```shell
cd conductor
docker compose -f docker/docker-compose-cassandra-es7.yaml build
docker compose -f docker/docker-compose-cassandra-es7.yaml up
# wait — conductor-server logs will show the stack above repeatedly
curl -i http://localhost:8000/health   # never returns 200
```

Reproduced on conductor `main` and on the branch I was testing (`feat/webhooks-from-orkes-split`).

### Probable root cause (both likely contribute)

1. **Cqlsh-passes-but-not-ready startup race.** The compose healthcheck on `conductor-cassandra` is `cqlsh -e \"describe keyspaces\"`. That succeeds the moment the native protocol port accepts connections, well before keyspace creation / schema initialization settles. `conductor-server` boots on the assumption Cassandra is ready, opens a driver session, and the underlying connection gets torn down before the first real query lands.

2. **Tight heap settings.** The compose sets `MAX_HEAP_SIZE=512M / HEAP_NEWSIZE=128M` on `cassandra:4`. Borderline on modern JVMs. Under boot load the driver may observe intermittent disconnects from GC pauses.

### Suggested fixes (cheapest first)

1. Bump cassandra heap in the compose to at least `MAX_HEAP_SIZE=1G / HEAP_NEWSIZE=256M`.
2. Tighten the cassandra healthcheck so it only reports healthy after the application keyspace exists — e.g. `cqlsh -e \"SELECT keyspace_name FROM system_schema.keyspaces WHERE keyspace_name='conductor'\"` with generous `retries`.
3. In `CassandraBaseDAO` (or wherever the startup session is opened), add bounded retry around the first query so a single transient disconnect doesn't kill the whole process.

### Impact

Blocks any docker-based Cassandra deployment of conductor-server. Surfaced while validating a separate PR's persistence matrix.

### How I found this

Running a new composite-workflow stress harness against each persistence backend (postgres, mysql, redis, cassandra) for a feature PR. Postgres and Redis passed cleanly; MySQL fails for a separate pre-existing reason (#1104); Cassandra fails as described above. The failing code path (`CassandraMetadataDAO.getAllTaskDefsFromDB`) is pre-existing core conductor code, not introduced by the PR I was testing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cassandra: conductor-server fails startup with "session is closed" during getAllTaskDefsFromDB #1135

Symptom

Reproduction

Probable root cause (both likely contribute)

Suggested fixes (cheapest first)

Impact

How I found this

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Cassandra: conductor-server fails startup with "session is closed" during getAllTaskDefsFromDB #1135

Description

Symptom

Reproduction

Probable root cause (both likely contribute)

Suggested fixes (cheapest first)

Impact

How I found this

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions