I did some investigation where I manually restart a CCF node container:
- The node attempts to join as if it never existed
- The CCF network denies it because it believes it already has a node with the same URL
So this doesn't just work, I spoke to Gaurav from the az cleanroom team, and he said the extension is capable of detecting dead nodes but doesn't automatically re-provision them. Orchestration is deferred to a higher level process.
I see a few possible solutions to this:
- We build a simple "orchestrator" which could just be a cron job type process which inspects the network health, and calls the scaling function if any nodes die such that the desired number is always maintained
- We rework the az-cleanroom containers such that they do some work to re-identify as the dead node
- We rework the az-cleanroom container such they join as a new node
I did some investigation where I manually restart a CCF node container:
So this doesn't just work, I spoke to Gaurav from the az cleanroom team, and he said the extension is capable of detecting dead nodes but doesn't automatically re-provision them. Orchestration is deferred to a higher level process.
I see a few possible solutions to this: