liquidmetal-dev · casibbald · Jun 25, 2025 · richardcase · Jun 30, 2025 · richardcase
diff --git a/.github/workflows/pr_type.yml b/.github/workflows/pr_type.yml
@@ -16,5 +16,5 @@ jobs:
     steps:
       - uses: docker://index.docker.io/agilepathway/pull-request-label-checker:latest@sha256:50540ac95f572ef27f2181130edd273f9ed75304f602fb43a8dd7e8ebf65fcca # latest
         with:
-          one_of: kind/bug,kind/documentation,kind/feature,kind/regression,kind/refactor,kind/cleanup,kind/chore
+          one_of: kind/bug,kind/documentation,kind/feature,kind/regression,kind/refactor,kind/cleanup,kind/chore,kind/proposal
           repo_token: ${{ secrets.GITHUB_TOKEN }}
diff --git a/docs/proposals/distributed_architecture/README.md b/docs/proposals/distributed_architecture/README.md
@@ -0,0 +1,104 @@
+# Liquid Metal Governance Enhancement Proposal
+
+## Enhancement Proposal: Flintlock Distributed Management and Operational Improvements
+
+**Proposal by:** Microscaler
+
+---
+
+## Table of Contents
+
+1. [Executive Summary](#executive-summary)
+2. [Motivation and Background](#motivation-and-background)
+3. [Microscaler's Requirements and Objectives](#microscalers-requirements-and-objectives)
+4. [Detailed Technical Proposals](#detailed-technical-proposals)
+5. [Technical and Community Value](#technical-and-community-value)
+6. [Benefits to the Broader Community](#benefits-to-the-broader-community)
+7. [Implementation Plan and Collaboration](#implementation-plan-and-collaboration)
+8. [Governance and Community Engagement](#governance-and-community-engagement)
+9. [References and Supporting Documentation](#references-and-supporting-documentation)
+
+---
+
+## Executive Summary
+
+Microscaler proposes comprehensive enhancements to Flintlock, significantly advancing distributed management capabilities, operational resilience, and scalability. Key benefits include improved reliability, reduced downtime, enhanced security, and simplified operational practices. Microscaler seeks active collaboration from the Liquid Metal governance board and the broader community to realise these improvements promptly and effectively.
+
+---
+
+## Motivation and Background
+
+Flintlock currently faces scalability, robustness, and operational clarity limitations that hinder effective deployment at scale. Real-world scenarios, such as managing large-scale distributed deployments and recovering from critical host failures, underscore the importance of these enhancements. Addressing these areas ensures Flintlock can effectively support complex, real-world operational needs.
+
+---
+
+## Microscaler's Requirements and Objectives
+
+Microscaler seeks these enhancements to efficiently manage distributed workloads at scale, minimise operational risks, enhance system resilience, and reduce administrative overhead. Clear objectives include reduced downtime, improved resource utilisation, and robust recovery mechanisms, benefiting both Microscaler and the broader community.
+
+---
+
+## Detailed Technical Proposal
+
+Please follow the links to detailed documents:
+
+| Enhancement                                        | Description | Link |
+|----------------------------------------------------| --- | --- |
+| Raft Consensus Integration                         | Integrate Raft for distributed consensus and reliability, ensuring consistent state replication and leader election across nodes. | [Details](./docs/01-Raft_Consensus_Integration.md) |
+| Distributed Scheduling and Bidding Mechanism       | Enable distributed workload scheduling and resource bidding, allowing dynamic allocation of resources based on demand and availability. | [Details](./docs/02-Distributed_Scheduling_and_Bidding_Mechanism.md) |
+| Unified API Interface and Proxy Routing            | Provide a unified API and proxy routing for seamless operations, simplifying client interactions and enabling transparent request forwarding. | [Details](./docs/03-Unified_API_Interface_and_Proxy_Routing.md) |
+| Host Failure Handling                              | Improve detection and recovery from host failures, including automated failover and state reconciliation to minimize service disruption. | [Details](./docs/04-Host_Failure_Handling.md) |
+| Detached Host Garbage Collection                   | Automate cleanup of detached or orphaned hosts, reclaiming resources and maintaining cluster hygiene without manual intervention. | [Details](./docs/05-Detached_Host_Garbage_Collection.md) |
+| Host Regenesis and PXE-Based Provisioning          | Support host re-provisioning using PXE, enabling rapid recovery and scaling by automating bare-metal host setup and configuration. | [Details](./docs/06-Host_Regenesis_and_PXE_Provisioning.md) |
+| VM Regeneration, Persistence and Recovery          | Ensure VM state can be persisted and recovered, allowing restoration of workloads after failures or migrations with minimal data loss. | [Details](./docs/07-VM_State_Persistence_and_Recovery.md) |
+| Network Partition Handling & Split-Brain Scenarios | Address network partitions and split-brain issues, implementing safeguards to maintain data consistency and prevent conflicting operations. | [Details](./docs/08-Network_Partition_Handling_and_Split-Brain_Scenarios.md) |
+| Leader Scheduling Bottleneck                       | Mitigate leader scheduling bottlenecks by distributing scheduling responsibilities and optimizing leader election processes for scalability. | [Details](./docs/09-Leader_Scheduling_Bottleneck.md) |
+| Host Rejoining and State Reconciliation            | Enable hosts to rejoin and reconcile state, ensuring that returning nodes synchronize with the cluster and recover their workloads safely. | [Details](./docs/10-Host_Rejoining_and_State_Reconciliation.md) |
+| Raft Log Scalability and Snapshotting              | Improve Raft log scalability and add snapshotting, reducing storage overhead and speeding up recovery by periodically compacting logs. | [Details](./docs/11-Raft_Log_Scalability_and_Snapshotting.md) |
+| Security and Authorization                         | Enhance security and authorization mechanisms, introducing fine-grained access controls and robust authentication for all operations. | [Details](./docs/12-Security_and_Authorization.md) |
+| Garbage Collection Policy                          | Define and enforce garbage collection policies, specifying criteria and schedules for resource cleanup to optimize system performance. | [Details](./docs/13-Garbage_Collection_Policy.md) |
+| Observability, Metrics, and Tracing                | Add observability, metrics, and tracing support, enabling real-time monitoring, troubleshooting, and performance analysis of distributed components. | [Details](./docs/14-Observability_Metrics_and_Tracing.md) |
+| Graceful VM Migration Support                      | Support graceful migration of VMs, allowing live or planned movement of workloads between hosts with minimal downtime and service impact. | [Details](./docs/15-Graceful_VM_Migration_Support.md) |
+| Configuration and Operational Clarity              | Improve configuration and operational transparency, providing clear documentation, validation, and tooling for easier management and troubleshooting. | [Details](./docs/16-Configuration_and_Operational_Clarity.md) |
+---
+
+## Technical and Community Value
+
+These enhancements address key technical challenges facing Flintlock, providing scalable, resilient, secure, and operationally efficient solutions. By collectively addressing these gaps, the community can significantly accelerate the adoption of Flintlock, ensuring its long-term innovation and viability.
+
+---
+
+## Benefits to the Broader Community
+
+* **Quantified Operational Benefits:**
+
+    * Potential reduction in downtime by up to 50%.
+    * Operational cost savings through reduced administrative overhead.
+    * Increased scalability supporting deployments exceeding current capacities.
+* **Community Growth Opportunities:**
+
+    * Google Summer of Code mentorships, attracting new contributors.
+    * Enhanced onboarding, facilitating community adoption.
+    * Robust knowledge-sharing and innovation opportunities across community members.
+
+---
+
+## Implementation Plan and Collaboration
+
+The Liquid Metal governance team will provide leadership and oversight, with Microscaler actively contributing through development, mentorship, and collaboration. Microscaler commits to supporting the governance team's vision and aligning contributions to benefit the broader community's collective goals. Community contributors will be engaged through incentivised programs, mentorship opportunities, and clear pathways for participation.
+
+---
+
+## Governance and Community Engagement
+
+Microscaler commits to transparent and open engagement with the governance board and community stakeholders. This includes regular communication, transparent decision-making processes, clear conflict resolution mechanisms, and continuous integration of community feedback, ensuring alignment with community values and project goals.
+
+---
+
+## References and Supporting Documentation
+
+Comprehensive technical documentation is available through linked documents, supporting detailed review and validation by the governance board and community.
+
+---
+
+Microscaler respectfully encourages the Liquid Metal governance board to adopt this proposal, fostering active community collaboration to enhance Flintlock for mutual benefit.
diff --git a/docs/proposals/distributed_architecture/coverletter.md b/docs/proposals/distributed_architecture/coverletter.md
@@ -0,0 +1,43 @@
+Charles Sibbald<br>
+Founder & Lead Engineer, Microscaler<br>
+Lead Engineer, Tinkerbell Project<br>
+26 June 2025
+
+Liquid Metal Governance Board<br>
+[Address or Contact Information]
+
+Dear Members of the Liquid Metal Governance Board,
+
+I am pleased to submit this Enhancement Proposal on behalf of Microscaler for your consideration. 
+At Microscaler, we are committed to advancing open-source technologies that empower scalable and robust 
+distributed infrastructure. 
+As the lead engineer for the Tinkerbell Project and founder of Microscaler, I firmly believe in 
+collaborative innovation that delivers substantial and lasting benefits to our community.
+
+The attached proposal outlines critical enhancements to Flintlock, focusing on comprehensive improvements 
+to distributed management, operational resilience, and scalability. If implemented successfully, these 
+enhancements position Flintlock as a leading system in distributed infrastructure management, significantly 
+ahead of contemporary solutions in robustness, operational clarity, and community-driven innovation.
+
+In preparing this proposal, we consciously chose not to fork Flintlock or fragment its potential user base. 
+
+Instead, we decided to collaborate with the existing community actively, enhancing Flintlock’s capabilities to 
+create a more compelling, unified solution. This collaborative approach not only preserves but actively 
+grows Flintlock's user base, fostering broader adoption and deeper community engagement.
+
+We have structured the proposal with an inclusive vision, inviting collaboration and contribution from the 
+broader community. This strategic partnership will ensure the enhancements meet diverse operational 
+requirements while maintaining alignment with our shared goals.
+
+Your support and leadership will be instrumental in realising the full potential of Flintlock. Together, we 
+can set new standards in operational excellence and community collaboration, establishing Flintlock 
+as the benchmark in its domain.
+
+Thank you for considering this forward-thinking proposal. I look forward to your feedback and to 
+continued collaboration 
+that advances our collective vision.
+
+Sincerely,
+
+Charles Sibbald
+Founder, Microscaler (Tinkerbell Project)
diff --git a/docs/proposals/distributed_architecture/docs/01-Raft_Consensus_Integration.md b/docs/proposals/distributed_architecture/docs/01-Raft_Consensus_Integration.md
@@ -0,0 +1,43 @@
+## Raft Consensus Integration
+
+### Gap Definition and Improvement Objectives
+
+Currently, Flintlock operates with isolated state per host, lacking a unified, cluster-wide coordination mechanism. Integrating Raft consensus addresses this gap by ensuring reliable leader election, consistent log replication, and synchronized state management across the cluster.
+
+**Objectives:**
+
+* Reliable leader election to ensure continuity
+* Consistent log replication across hosts
+* Robust global state synchronization for VM management
+
+### Technical Implementation and Detailed Architecture
+
+* **Raft Library:** Leverage a well-established Raft implementation such as HashiCorp Raft or etcd Raft.
+* **Leader Election:** Implement leader election protocols ensuring rapid detection of failures and quick election of a new leader.
+* **Log Replication:** Define structured logs capturing critical VM lifecycle events (creation, updates, deletion).
+* **Cluster-wide State Machine:** Develop a state machine that consistently applies VM lifecycle operations from replicated logs.
+
+### Trade-offs and Risks
+
+* **Complexity:** Increased system complexity balanced by significant reliability improvements.
+* **Performance Overhead:** Slight overhead from log replication and consensus coordination, which must be monitored and optimized.
+
+### Operational Impacts and User Considerations
+
+* **Transparency:** The integration should remain transparent to end-users, requiring no changes in current workflows.
+* **Reliability:** Improved operational reliability and simplified management for system operators.
+
+### Validation and Testing Strategies
+
+* **Leader Election Tests:** Comprehensive tests to validate rapid leader election and failover.
+* **Log Replication Tests:** Validate accuracy and performance of log replication across nodes.
+* **State Consistency Tests:** Continuously ensure the cluster maintains a consistent view of the global state.
+
+### Visualizations and Diagrams
+
+* **High-Level Design (HLD) Diagram:** Clearly illustrates the integration of Raft within Flintlock.
+* **Sequence Diagram:** Demonstrates the leader election, log replication, and state synchronization processes clearly.
+
+### Summary for Enhancement Proposal
+
+Integrating Raft consensus into Flintlock significantly enhances cluster reliability, consistency, and operational resilience. This structured approach ensures minimal operational overhead while providing robust coordination capabilities, preparing Flintlock for highly available, distributed deployments.
diff --git a/...istributed_architecture/docs/02-Distributed_Scheduling_and_Bidding_Mechanism.md b/...istributed_architecture/docs/02-Distributed_Scheduling_and_Bidding_Mechanism.md
@@ -0,0 +1,79 @@
+## Distributed Scheduling and Bidding Mechanism
+
+### Gap Definition and Improvement Objectives
+
+Flintlock currently lacks a distributed scheduling system, relying instead on manual workload allocation per host. Implementing a distributed scheduling mechanism using a bidding process will ensure balanced resource utilization and improved VM provisioning speed.
+
+**Objectives:**
+
+* Balanced resource allocation across all hosts
+* Reduced VM boot latency
+* Automated and transparent workload distribution
+
+### Technical Implementation and Detailed Architecture
+
+* **Resource Broadcasting:** Hosts periodically broadcast current resource metrics (CPU, memory, VM count).
+* **Leader Coordination:** Leader initiates VM scheduling by broadcasting requests to hosts.
+* **Bid Calculation:** Hosts compute utilization scores based on available resources and VM requirements, responding with bids.
+* **Scheduling Decision:** Leader selects the host with the lowest utilization score (best bid), updating the global state through consensus.
+
+### Trade-offs and Risks
+
+* **Complexity:** Additional complexity due to broadcast and bidding logic.
+* **Latency:** Slight communication overhead for bid requests and responses.
+
+### Operational Impacts and User Considerations
+
+* **Transparency:** Users experience automated and balanced VM scheduling without manual intervention.
+* **Operational Simplicity:** Reduced administrative overhead and improved cluster scalability.
+
+### Validation and Testing Strategies
+
+* **Bid Accuracy Tests:** Ensure host bids accurately reflect resource availability.
+* **Scheduling Fairness Tests:** Verify balanced workload distribution across hosts.
+* **Performance Benchmarks:** Assess scheduling latency and efficiency under various loads.
+
+### Visualizations and Diagrams
+
+* **High-Level Design (HLD) Diagram:**
+
+```mermaid
+graph TD
+  Leader["Raft Leader"]
+  Host1["Flintlock Host 1"]
+  Host2["Flintlock Host 2"]
+  HostN["Flintlock Host N"]
+
+  Leader -->|Broadcast VM Request| Host1
+  Leader -->|Broadcast VM Request| Host2
+  Leader -->|Broadcast VM Request| HostN
+
+  Host1 -->|Bid Response| Leader
+  Host2 -->|Bid Response| Leader
+  HostN -->|Bid Response| Leader
+
+  Leader -->|Scheduling Decision| Host1
+  Leader -->|State Update via Raft| Host2
+  Leader -->|State Update via Raft| HostN
+```
+
+* **Sequence Diagram:**
+
+```mermaid
+sequenceDiagram
+  participant Leader
+  participant Host1
+  participant Host2
+
+  Leader->>Host1: Broadcast VM scheduling request
+  Leader->>Host2: Broadcast VM scheduling request
+  Host1->>Leader: Bid (utilization score)
+  Host2->>Leader: Bid (utilization score)
+  Leader->>Leader: Evaluate best bid
+  Leader->>Host1: Scheduling decision
+  Leader->>Host2: State update via Raft
+```
+
+### Summary for Enhancement Proposal
+
+Introducing a distributed scheduling and bidding mechanism significantly enhances Flintlock's ability to evenly distribute workloads and minimize VM provisioning times. This approach improves cluster resource utilization efficiency and operational transparency, setting the foundation for robust scalability and responsiveness.
diff --git a/...als/distributed_architecture/docs/03-Unified_API_Interface_and_Proxy_Routing.md b/...als/distributed_architecture/docs/03-Unified_API_Interface_and_Proxy_Routing.md
@@ -0,0 +1,72 @@
+## Unified API Interface and Proxy Routing
+
+### Gap Definition and Improvement Objectives
+
+Currently, Flintlock APIs are isolated per host, causing inconsistent and fragmented state queries. Introducing a unified API interface with proxy routing will resolve these inconsistencies and enable accurate VM state reporting from the authoritative host.
+
+**Objectives:**
+
+* Preserve compatibility with existing Flintlock APIs
+* Provide enhanced global state querying through new `/api/v2` endpoints
+* Implement proxy routing for authoritative, real-time VM status
+
+### Technical Implementation and Detailed Architecture
+
+* **API Versioning:** Retain current `/api/v1` APIs for backward compatibility, adding new `/api/v2` endpoints.
+* **Global State Registry:** Maintain minimal global VM metadata (host location, VM ID).
+* **Proxy Routing:** Route detailed state queries to the actual host running the VM, providing accurate real-time metrics.
+
+### Trade-offs and Risks
+
+* **Latency:** Slight latency increase from proxy routing requests to authoritative hosts.
+* **Complexity:** Increased API routing logic to handle proxy queries.
+
+### Operational Impacts and User Considerations
+
+* **Transparency:** Users experience transparent and accurate VM state querying without changing existing workflows.
+* **Improved Observability:** Enhanced visibility into VM states and metrics.
+
+### Validation and Testing Strategies
+
+* **API Compatibility Tests:** Ensure backward compatibility with existing endpoints.
+* **Proxy Routing Accuracy Tests:** Verify accuracy and responsiveness of proxy-routed queries.
+* **Real-time Metrics Validation:** Continuous validation of real-time metrics accuracy.
+
+### Visualizations and Diagrams
+
+* **High-Level Design (HLD) Diagram:**
+
+```mermaid
+graph TD
+Client["API Client"]
+HostA["Flintlock Host A (Leader)"]
+HostB["Flintlock Host B"]
+HostC["Flintlock Host C"]
+
+Client -->|Query VM Status| HostA
+HostA -->|Lookup VM location| HostB
+HostB -->|Real-time VM status| HostA
+HostA -->|Response| Client
+HostA --- HostC
+HostB --- HostC
+```
+
+* **Sequence Diagram:**
+
+```mermaid
+sequenceDiagram
+actor Client
+participant HostA
+participant HostB
+participant HostC
+
+Client->>HostA: GET /api/v2/vm/{vm_id}/status
+HostA->>HostA: Lookup VM location
+HostA->>HostB: Proxy GET /api/v2/vm/{vm_id}/status
+HostB->>HostA: Real-time VM metrics
+HostA->>Client: Forward VM status
+```
+
+### Summary for Enhancement Proposal
+
+Implementing a unified API interface with proxy routing significantly improves the consistency and accuracy of VM state queries in Flintlock. This enhancement provides transparent compatibility, real-time metrics accuracy, and enhanced operational visibility, preparing Flintlock for effective distributed operations.