[iris] tolerate Finelog being unreachable during controller startup#5766
[iris] tolerate Finelog being unreachable during controller startup#5766ravwojdyla-agent wants to merge 3 commits into
Conversation
LogClient.get_table issues a synchronous RegisterTable RPC; previously a finelog outage at startup crashed the controller, even though every runtime writer already null-checks the resulting Table. Wrap the three startup get_table calls (K8s task_stats + profile, controller-process profile) in a helper that catches the failure and schedules a background ManagedThread to retry registration with bounded exponential backoff. On retry success the helper installs the Table where runtime writers look for it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Could we fix this on the FineLog side? I think if we make the table creation async, that would get rid of the need for the special helpers. So you'd always get the table and be able to write to it, and we'd just move the register into the background thread. @claude can you sketch that out for me in a comment? |
|
Claude finished @rjpower's task in 2m 25s —— View job Sketch: move
|
LogClient.get_tableissues a synchronousRegisterTableRPC; previously a finelog outage at startup crashed the controller even though runtime writers already null-check the resultingTable1Controller._register_finelog_tablehelper that wrapsget_table, catches failures, and spawns a backgroundManagedThreadthat retries registration with bounded exponential backoff (1s → 60s)on_late_successcallback so the caller can install theTablewhere runtime writers look for itThreadContainerand stops with the controllerget_tablesites through the helper: K8s provider'stask_stats_tableandprofile_table, andControllerServiceImpl's controller-processprofile_tableControllerServiceImpl._profile_tableregistration out of__init__; addset_profile_tableso the controller installs it after construction (and re-installs from the retry thread)test_controller_starts_when_finelog_table_registration_failscovering both halves: controller comes up with_profile_table = None, then the background retry installs itFootnotes
asymmetry confirmed by reading the codebase —
Table.writeenqueues into an in-memory buffer with bg-thread flush + retry, so once registered the table tolerates finelog outages; only the upfront synchronousregister_tablewas unprotected. ↩