Enhancement Proposal: Distributed Management, Operational Resilience #1060

casibbald · 2025-06-29T15:05:19Z

What this PR does / why we need it:

Elevate Flintlock with Robust Distributed Management, Operational Resilience, and Community-Driven Scalability

Charles Sibbald

Founder & Lead Engineer, Microscaler

Lead Engineer, Tinkerbell Project

26 June 2025

Liquid Metal Governance Board

[Address or Contact Information]

Dear Members of the Liquid Metal Governance Board,

I am pleased to submit this Enhancement Proposal on behalf of Microscaler for your consideration.
At Microscaler, we are committed to advancing open-source technologies that empower scalable and robust
distributed infrastructure.
As the lead engineer for the Tinkerbell Project and founder of Microscaler, I firmly believe in
collaborative innovation that delivers substantial and lasting benefits to our community.

The attached proposal outlines critical enhancements to Flintlock, focusing on comprehensive improvements
to distributed management, operational resilience, and scalability. If implemented successfully, these
enhancements position Flintlock as a leading system in distributed infrastructure management, significantly
ahead of contemporary solutions in robustness, operational clarity, and community-driven innovation.

In preparing this proposal, we consciously chose not to fork Flintlock or fragment its potential user base.

Instead, we decided to collaborate with the existing community actively, enhancing Flintlock’s capabilities to
create a more compelling, unified solution. This collaborative approach not only preserves but actively
grows Flintlock's user base, fostering broader adoption and deeper community engagement.

We have structured the proposal with an inclusive vision, inviting collaboration and contribution from the
broader community. This strategic partnership will ensure the enhancements meet diverse operational
requirements while maintaining alignment with our shared goals.

Your support and leadership will be instrumental in realising the full potential of Flintlock. Together, we
can set new standards in operational excellence and community collaboration, establishing Flintlock
as the benchmark in its domain.

Thank you for considering this forward-thinking proposal. I look forward to your feedback and to
continued collaboration
that advances our collective vision.

Sincerely,

Charles Sibbald
Founder, Microscaler (Tinkerbell Project)

Which issue(s) this PR fixes *
N/A

Special notes for your reviewer:

Checklist:

netlify · 2025-06-29T15:05:23Z

✅ Deploy Preview for flintlock-docs canceled.

Name	Link
🔨 Latest commit	`ce38f31`
🔍 Latest deploy log	https://app.netlify.com/projects/flintlock-docs/deploys/6862a666f746200008581097

Copilot

Pull Request Overview

This PR introduces a series of proposal documents aimed at enhancing Flintlock’s distributed management. The documents cover a wide range of topics including configuration clarity, VM migration, observability, garbage collection, security, scheduling, network partition handling, host provisioning, failure recovery, API unification, distributed scheduling, and Raft consensus integration.

Adds detailed proposal documents (in Markdown) for each enhancement area.
Updates coverletter and README to reflect the holistic enhancement proposal.

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
docs/proposals/distributed_architecture/docs/15-Configuration_and_Operational_Clarity.md	New proposal outlining centralized configuration and operational clarity.
docs/proposals/distributed_architecture/docs/14-Graceful_VM_Migration_Support.md	New proposal for introducing graceful migration support for VMs.
docs/proposals/distributed_architecture/docs/13-Observability_Metrics_and_Tracing.md	Details observability enhancements via metrics, logging, and tracing.
docs/proposals/distributed_architecture/docs/12-Garbage_Collection_Policy.md	Introduces explicit garbage collection policies for improved resource management.
docs/proposals/distributed_architecture/docs/11-Security_and_Authorization.md	Proposes security upgrades using mTLS, authentication and RBAC.
docs/proposals/distributed_architecture/docs/10-Raft_Log_Scalability_and_Snapshotting.md	Outlines improvements for log compaction and snapshotting with Raft.
docs/proposals/distributed_architecture/docs/09-Host_Rejoining and_State_Reconciliation.md	Describes procedures for host reintegration and state reconciliation.
docs/proposals/distributed_architecture/docs/08-Leader_Scheduling_Bottleneck.md	Proposes decentralizing scheduling to alleviate leader bottlenecks.
docs/proposals/distributed_architecture/docs/07-Network_Partition_Handling_&_Split-Brain_Scenarios.md	Details strategies to handle network partitions and prevent split-brain issues.
docs/proposals/distributed_architecture/docs/06-Host_Regeneration_and_PXE-Based_Provisioning.md	Presents PXE-based provisioning for automated host regeneration.
docs/proposals/distributed_architecture/docs/05-Detached_Host_Garbage_Collection.md	Defines automated cleanup for detached hosts to prevent resource leakage.
docs/proposals/distributed_architecture/docs/04-Host_Failure_Handling(Leader_and_Follower_Failures).md	Proposes recovery strategies for host failures with automated VM resurrection.
docs/proposals/distributed_architecture/docs/03-Unified_API_Interface_and_Proxy_Routing.md	Outlines a unified API interface with proxy routing for consistent state queries.
docs/proposals/distributed_architecture/docs/02-Distributed_Scheduling_and_Bidding_Mechanism.md	Introduces a resource bidding mechanism for distributed VM scheduling.
docs/proposals/distributed_architecture/docs/01-Raft_Consensus_Integration.md	Details integration of Raft consensus for cluster-wide state synchronization.
docs/proposals/distributed_architecture/coverletter.md	Cover letter summarizing the enhancement proposal.
docs/proposals/distributed_architecture/README.md	README providing a high-level summary and table of contents for the proposal.

Comments suppressed due to low confidence (3)

docs/proposals/distributed_architecture/docs/09-Host_Rejoining and_State_Reconciliation.md:1

[nitpick] The file name contains a space ('Host_Rejoining and_State_Reconciliation.md'), which might cause issues with URL references or tooling. Consider renaming it (e.g., using underscores or hyphens consistently) for improved consistency.

## Host Rejoining and State Reconciliation

docs/proposals/distributed_architecture/README.md:53

The table of contents references a 'VM State Persistence and Recovery' document that does not correspond to any provided file. Please update the README to reflect the correct proposal documents and renumber accordingly.

| VM State Persistence and Recovery | Ensure VM state can be persisted and recovered, allowing restoration of workloads after failures or migrations with minimal data loss. | [Details](./docs/07-VM_State_Persistence_and_Recovery.md) |

docs/proposals/distributed_architecture/README.md:52

[nitpick] The README refers to 'Host Regenesis and PXE-Based Provisioning', while the actual file is named '06-Host_Regeneration_and_PXE-Based_Provisioning.md'. Update the naming in the README for consistency.

| Host Regenesis and PXE-Based Provisioning | Support host re-provisioning using PXE, enabling rapid recovery and scaling by automating bare-metal host setup and configuration. | [Details](./docs/06-Host_Regenesis_and_PXE_Provisioning.md) |

...osals/distributed_architecture/docs/07-Network_Partition_Handling_&_Split-Brain_Scenarios.md

…ion_Handling_&_Split-Brain_Scenarios.md Co-authored-by: Copilot <[email protected]> chore: Enhancement Proposal: Elevate Flintlock with Robust Distributed Management, Operational Resilience, and Community-Driven Scalability

richardcase

Thanks @casibbald . Lets jump on a call to discuss.

I'd like to keep clear separation of concerns when adding support a set of flintlocks and scheduling. I'd be wary of reimplementing etcd for example as well.

richardcase · 2025-06-30T16:29:56Z

docs/proposals/distributed_architecture/README.md

+
+## Motivation and Background
+
+Flintlock currently faces scalability, robustness, and operational clarity limitations that hinder effective deployment at scale. Real-world scenarios, such as managing large-scale distributed deployments and recovering from critical host failures, underscore the importance of these enhancements. Addressing these areas ensures Flintlock can effectively support complex, real-world operational needs.


Flintlock was only ever designed for a single host. It was always envisioned that any distributed scheduling/clustering would be done at a layer above.

I'd be keen to maintain this separation and keep flintlock solely focused on interacting with microvms on a single machine.

The layer above is something i've thought of as being called brigade: https://github.com/liquidmetal-dev/brigade

The idea is that it would API compatible with flintlock, so that consumers, like CAPMVM, could switch to distributed scheduling across a number of flintlock hosts without any changes.

richardcase · 2025-06-30T16:36:21Z

docs/proposals/distributed_architecture/docs/01-Raft_Consensus_Integration.md

+
+### Gap Definition and Improvement Objectives
+
+Currently, Flintlock operates with isolated state per host, lacking a unified, cluster-wide coordination mechanism. Integrating Raft consensus addresses this gap by ensuring reliable leader election, consistent log replication, and synchronized state management across the cluster.


Raft could be used on the layer that does the distributed scheduling, but not in flintlock itself.

One option would be to just use etcd for storege and leader election. Might be easier than using a Raft package (like https://github.com/etcd-io/raft) directly.

richardcase · 2025-06-30T16:37:34Z

docs/proposals/distributed_architecture/docs/02-Distributed_Scheduling_and_Bidding_Mechanism.md

+
+### Gap Definition and Improvement Objectives
+
+Flintlock currently lacks a distributed scheduling system, relying instead on manual workload allocation per host. Implementing a distributed scheduling mechanism using a bidding process will ensure balanced resource utilization and improved VM provisioning speed.


This was never going to be part of Flintlock itself, but a part of the wider Liquid Metal.

A reverse bidding process would make the scheduling simpler.

richardcase · 2025-06-30T16:38:34Z

docs/proposals/distributed_architecture/docs/03-Unified_API_Interface_and_Proxy_Routing.md

+**Objectives:**
+
+* Preserve compatibility with existing Flintlock APIs
+* Provide enhanced global state querying through new `/api/v2` endpoints


We wouldn't need a v2 if the changes were made outside of flintlock.

richardcase · 2025-06-30T16:40:45Z

docs/proposals/distributed_architecture/docs/06-Host_Regenesis_and_PXE-Based_Provisioning.md

+
+### Gap Definition and Improvement Objectives
+
+Currently, Flintlock lacks automated and streamlined procedures for securely reprovisioning and reintegrating detached hosts back into the cluster. Introducing PXE-based provisioning will enable automated host regeneration and reduce manual operational tasks.


detached hosts into the cluster

I'm taking that to mean the machines that run flintlock.

I think PXE based provisioning is out of scope of this.

This is an implementation detail of anyone that uses Liquid Metal / Flintlock.

richardcase · 2025-06-30T16:42:43Z

docs/proposals/distributed_architecture/docs/11-Security_and_Authorization.md

+
+**Objectives:**
+
+* Secure intra-cluster communication using mutual TLS (mTLS)


Flintlock has mTLS support: #464

richardcase · 2025-06-30T16:43:39Z

docs/proposals/distributed_architecture/docs/11-Security_and_Authorization.md

+**Objectives:**
+
+* Secure intra-cluster communication using mutual TLS (mTLS)
+* Robust authentication mechanisms for cluster nodes


Lets talk about auth as it applies to the layer above flintlock.

richardcase · 2025-06-30T16:45:06Z

docs/proposals/distributed_architecture/docs/14-Graceful_VM_Migration_Support.md

+
+### Gap Definition and Improvement Objectives
+
+Currently, Flintlock lacks explicit support for seamless VM migration, which limits flexibility and service continuity during host maintenance or failures. Introducing VM migration capabilities will significantly reduce downtime and improve service resilience.


The only way to do migration is to take a snapshot and boot another VM from the snapshot.

github-actions · 2025-08-30T07:24:57Z

This PR is stale because it has been open 60 days with no activity.

Copilot AI review requested due to automatic review settings June 29, 2025 15:05

casibbald added the kind/proposal label Jun 29, 2025

Copilot AI reviewed Jun 29, 2025

View reviewed changes

...osals/distributed_architecture/docs/07-Network_Partition_Handling_&_Split-Brain_Scenarios.md Outdated Show resolved Hide resolved

casibbald force-pushed the Microscaler-distributed-enhancement-proposal branch 4 times, most recently from 9b89548 to 8325b9e Compare June 30, 2025 14:58

casibbald force-pushed the Microscaler-distributed-enhancement-proposal branch from 8325b9e to ce38f31 Compare June 30, 2025 14:59

richardcase reviewed Jun 30, 2025

View reviewed changes

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 30, 2025


		## Motivation and Background

		Flintlock currently faces scalability, robustness, and operational clarity limitations that hinder effective deployment at scale. Real-world scenarios, such as managing large-scale distributed deployments and recovering from critical host failures, underscore the importance of these enhancements. Addressing these areas ensures Flintlock can effectively support complex, real-world operational needs.


		### Gap Definition and Improvement Objectives

		Currently, Flintlock operates with isolated state per host, lacking a unified, cluster-wide coordination mechanism. Integrating Raft consensus addresses this gap by ensuring reliable leader election, consistent log replication, and synchronized state management across the cluster.


		### Gap Definition and Improvement Objectives

		Flintlock currently lacks a distributed scheduling system, relying instead on manual workload allocation per host. Implementing a distributed scheduling mechanism using a bidding process will ensure balanced resource utilization and improved VM provisioning speed.


		### Gap Definition and Improvement Objectives

		Currently, Flintlock lacks automated and streamlined procedures for securely reprovisioning and reintegrating detached hosts back into the cluster. Introducing PXE-based provisioning will enable automated host regeneration and reduce manual operational tasks.


		Objectives:

		* Secure intra-cluster communication using mutual TLS (mTLS)


		### Gap Definition and Improvement Objectives

		Currently, Flintlock lacks explicit support for seamless VM migration, which limits flexibility and service continuity during host maintenance or failures. Introducing VM migration capabilities will significantly reduce downtime and improve service resilience.

Enhancement Proposal: Distributed Management, Operational Resilience #1060

Are you sure you want to change the base?

Enhancement Proposal: Distributed Management, Operational Resilience #1060

Uh oh!

Conversation

casibbald commented Jun 29, 2025

Uh oh!

netlify bot commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for flintlock-docs canceled.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

richardcase left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

netlify bot commented Jun 29, 2025 •

edited

Loading