Skip to content

Conversation

@casibbald
Copy link

What this PR does / why we need it:

Elevate Flintlock with Robust Distributed Management, Operational Resilience, and Community-Driven Scalability


Charles Sibbald

Founder & Lead Engineer, Microscaler

Lead Engineer, Tinkerbell Project

26 June 2025

Liquid Metal Governance Board

[Address or Contact Information]

Dear Members of the Liquid Metal Governance Board,

I am pleased to submit this Enhancement Proposal on behalf of Microscaler for your consideration.
At Microscaler, we are committed to advancing open-source technologies that empower scalable and robust
distributed infrastructure.
As the lead engineer for the Tinkerbell Project and founder of Microscaler, I firmly believe in
collaborative innovation that delivers substantial and lasting benefits to our community.

The attached proposal outlines critical enhancements to Flintlock, focusing on comprehensive improvements
to distributed management, operational resilience, and scalability. If implemented successfully, these
enhancements position Flintlock as a leading system in distributed infrastructure management, significantly
ahead of contemporary solutions in robustness, operational clarity, and community-driven innovation.

In preparing this proposal, we consciously chose not to fork Flintlock or fragment its potential user base.

Instead, we decided to collaborate with the existing community actively, enhancing Flintlock’s capabilities to
create a more compelling, unified solution. This collaborative approach not only preserves but actively
grows Flintlock's user base, fostering broader adoption and deeper community engagement.

We have structured the proposal with an inclusive vision, inviting collaboration and contribution from the
broader community. This strategic partnership will ensure the enhancements meet diverse operational
requirements while maintaining alignment with our shared goals.

Your support and leadership will be instrumental in realising the full potential of Flintlock. Together, we
can set new standards in operational excellence and community collaboration, establishing Flintlock
as the benchmark in its domain.

Thank you for considering this forward-thinking proposal. I look forward to your feedback and to
continued collaboration
that advances our collective vision.

Sincerely,

Charles Sibbald
Founder, Microscaler (Tinkerbell Project)

Which issue(s) this PR fixes *
N/A

Special notes for your reviewer:

Checklist:

Copilot AI review requested due to automatic review settings June 29, 2025 15:05
@netlify
Copy link

netlify bot commented Jun 29, 2025

Deploy Preview for flintlock-docs canceled.

Name Link
🔨 Latest commit ce38f31
🔍 Latest deploy log https://app.netlify.com/projects/flintlock-docs/deploys/6862a666f746200008581097

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a series of proposal documents aimed at enhancing Flintlock’s distributed management. The documents cover a wide range of topics including configuration clarity, VM migration, observability, garbage collection, security, scheduling, network partition handling, host provisioning, failure recovery, API unification, distributed scheduling, and Raft consensus integration.

  • Adds detailed proposal documents (in Markdown) for each enhancement area.
  • Updates coverletter and README to reflect the holistic enhancement proposal.

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
docs/proposals/distributed_architecture/docs/15-Configuration_and_Operational_Clarity.md New proposal outlining centralized configuration and operational clarity.
docs/proposals/distributed_architecture/docs/14-Graceful_VM_Migration_Support.md New proposal for introducing graceful migration support for VMs.
docs/proposals/distributed_architecture/docs/13-Observability_Metrics_and_Tracing.md Details observability enhancements via metrics, logging, and tracing.
docs/proposals/distributed_architecture/docs/12-Garbage_Collection_Policy.md Introduces explicit garbage collection policies for improved resource management.
docs/proposals/distributed_architecture/docs/11-Security_and_Authorization.md Proposes security upgrades using mTLS, authentication and RBAC.
docs/proposals/distributed_architecture/docs/10-Raft_Log_Scalability_and_Snapshotting.md Outlines improvements for log compaction and snapshotting with Raft.
docs/proposals/distributed_architecture/docs/09-Host_Rejoining and_State_Reconciliation.md Describes procedures for host reintegration and state reconciliation.
docs/proposals/distributed_architecture/docs/08-Leader_Scheduling_Bottleneck.md Proposes decentralizing scheduling to alleviate leader bottlenecks.
docs/proposals/distributed_architecture/docs/07-Network_Partition_Handling_&_Split-Brain_Scenarios.md Details strategies to handle network partitions and prevent split-brain issues.
docs/proposals/distributed_architecture/docs/06-Host_Regeneration_and_PXE-Based_Provisioning.md Presents PXE-based provisioning for automated host regeneration.
docs/proposals/distributed_architecture/docs/05-Detached_Host_Garbage_Collection.md Defines automated cleanup for detached hosts to prevent resource leakage.
docs/proposals/distributed_architecture/docs/04-Host_Failure_Handling(Leader_and_Follower_Failures).md Proposes recovery strategies for host failures with automated VM resurrection.
docs/proposals/distributed_architecture/docs/03-Unified_API_Interface_and_Proxy_Routing.md Outlines a unified API interface with proxy routing for consistent state queries.
docs/proposals/distributed_architecture/docs/02-Distributed_Scheduling_and_Bidding_Mechanism.md Introduces a resource bidding mechanism for distributed VM scheduling.
docs/proposals/distributed_architecture/docs/01-Raft_Consensus_Integration.md Details integration of Raft consensus for cluster-wide state synchronization.
docs/proposals/distributed_architecture/coverletter.md Cover letter summarizing the enhancement proposal.
docs/proposals/distributed_architecture/README.md README providing a high-level summary and table of contents for the proposal.
Comments suppressed due to low confidence (3)

docs/proposals/distributed_architecture/docs/09-Host_Rejoining and_State_Reconciliation.md:1

  • [nitpick] The file name contains a space ('Host_Rejoining and_State_Reconciliation.md'), which might cause issues with URL references or tooling. Consider renaming it (e.g., using underscores or hyphens consistently) for improved consistency.
## Host Rejoining and State Reconciliation

docs/proposals/distributed_architecture/README.md:53

  • The table of contents references a 'VM State Persistence and Recovery' document that does not correspond to any provided file. Please update the README to reflect the correct proposal documents and renumber accordingly.
| VM State Persistence and Recovery | Ensure VM state can be persisted and recovered, allowing restoration of workloads after failures or migrations with minimal data loss. | [Details](./docs/07-VM_State_Persistence_and_Recovery.md) |

docs/proposals/distributed_architecture/README.md:52

  • [nitpick] The README refers to 'Host Regenesis and PXE-Based Provisioning', while the actual file is named '06-Host_Regeneration_and_PXE-Based_Provisioning.md'. Update the naming in the README for consistency.
| Host Regenesis and PXE-Based Provisioning | Support host re-provisioning using PXE, enabling rapid recovery and scaling by automating bare-metal host setup and configuration. | [Details](./docs/06-Host_Regenesis_and_PXE_Provisioning.md) |

@casibbald casibbald force-pushed the Microscaler-distributed-enhancement-proposal branch 4 times, most recently from 9b89548 to 8325b9e Compare June 30, 2025 14:58
…ion_Handling_&_Split-Brain_Scenarios.md

Co-authored-by: Copilot <[email protected]>

chore: Enhancement Proposal: Elevate Flintlock with Robust Distributed Management, Operational Resilience, and Community-Driven Scalability
@casibbald casibbald force-pushed the Microscaler-distributed-enhancement-proposal branch from 8325b9e to ce38f31 Compare June 30, 2025 14:59
Copy link
Member

@richardcase richardcase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @casibbald . Lets jump on a call to discuss.

I'd like to keep clear separation of concerns when adding support a set of flintlocks and scheduling. I'd be wary of reimplementing etcd for example as well.


## Motivation and Background

Flintlock currently faces scalability, robustness, and operational clarity limitations that hinder effective deployment at scale. Real-world scenarios, such as managing large-scale distributed deployments and recovering from critical host failures, underscore the importance of these enhancements. Addressing these areas ensures Flintlock can effectively support complex, real-world operational needs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flintlock was only ever designed for a single host. It was always envisioned that any distributed scheduling/clustering would be done at a layer above.

I'd be keen to maintain this separation and keep flintlock solely focused on interacting with microvms on a single machine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The layer above is something i've thought of as being called brigade: https://github.com/liquidmetal-dev/brigade

The idea is that it would API compatible with flintlock, so that consumers, like CAPMVM, could switch to distributed scheduling across a number of flintlock hosts without any changes.


### Gap Definition and Improvement Objectives

Currently, Flintlock operates with isolated state per host, lacking a unified, cluster-wide coordination mechanism. Integrating Raft consensus addresses this gap by ensuring reliable leader election, consistent log replication, and synchronized state management across the cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raft could be used on the layer that does the distributed scheduling, but not in flintlock itself.

One option would be to just use etcd for storege and leader election. Might be easier than using a Raft package (like https://github.com/etcd-io/raft) directly.


### Gap Definition and Improvement Objectives

Flintlock currently lacks a distributed scheduling system, relying instead on manual workload allocation per host. Implementing a distributed scheduling mechanism using a bidding process will ensure balanced resource utilization and improved VM provisioning speed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was never going to be part of Flintlock itself, but a part of the wider Liquid Metal.

A reverse bidding process would make the scheduling simpler.

**Objectives:**

* Preserve compatibility with existing Flintlock APIs
* Provide enhanced global state querying through new `/api/v2` endpoints
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wouldn't need a v2 if the changes were made outside of flintlock.


### Gap Definition and Improvement Objectives

Currently, Flintlock lacks automated and streamlined procedures for securely reprovisioning and reintegrating detached hosts back into the cluster. Introducing PXE-based provisioning will enable automated host regeneration and reduce manual operational tasks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detached hosts into the cluster

I'm taking that to mean the machines that run flintlock.

I think PXE based provisioning is out of scope of this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an implementation detail of anyone that uses Liquid Metal / Flintlock.


**Objectives:**

* Secure intra-cluster communication using mutual TLS (mTLS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flintlock has mTLS support: #464

**Objectives:**

* Secure intra-cluster communication using mutual TLS (mTLS)
* Robust authentication mechanisms for cluster nodes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets talk about auth as it applies to the layer above flintlock.


### Gap Definition and Improvement Objectives

Currently, Flintlock lacks explicit support for seamless VM migration, which limits flexibility and service continuity during host maintenance or failures. Introducing VM migration capabilities will significantly reduce downtime and improve service resilience.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way to do migration is to take a snapshot and boot another VM from the snapshot.

@github-actions
Copy link
Contributor

This PR is stale because it has been open 60 days with no activity.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/proposal lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants