Skip to content

Conversation

@vrutkovs
Copy link
Contributor

@vrutkovs vrutkovs commented Oct 21, 2025

Add a new CR - VMDistributedCluster - so that multiple VMClusters can be upgraded in an orchestrated fashion, ensuring the read VMAuth is disabled before upgrade and the VMAgent (if available) doesn't have pending bytes to send.

Fixes #1515

This CR can refer to VMClusters using one of two possible ways:

  • Existing VMClusters can be referred to using ref property and changes applied using overrideSpec
  • Entirely new VMClusters can be created with name and spec properties

Either way, settings in VMDistributedCluster would be applied to target VMClusters, overriding their existing settings if necessary.

Current implementation scope:

  • VMDistributedCluster will create a VMAgent instance to proxy writes and vmauth LB to proxy reads
  • VMDistributedCluster can create new VMCluster instances when name and spec are specified
  • VMDistributedCluster can update existing VMCluster objects when ref and overrideSpec are set
  • Before a cluster is updated, vmauth LB is updated to disable reads from this cluster
  • VMClusters are updated one by one, waiting for them to change status to "operational" again
  • Time to wait for the cluster to become ready can be configured
  • After VMCluster update is complete, we're waiting for VMAgent to flush collected data again by checking its metrics
  • VMAuth LB is updated to enable reads from this cluster
  • Optionally, the controller can wait a configurable amount of time before proceeding to the next cluster
  • Process is repeated for all remaining VMClusters

See #1515 (comment) for agreed limitations for v1alpha1 version:

  • All objects must belong to the same namespace as VMDistributedCluster
  • Referenced VMClusters are not being actively watched for changes, they only get reconciled periodically
  • All objects must be referred to by name, label selectors are not supported
  • Only VMClusters are supported, VMSingles are deferred for other versions
  • Two delays are tweakable:
    • vmclusterWaitReadyDeadline
    • delay between zone updates
  • No additional metric to indicate that the cluster is being upgraded to silence possible alerts

TODO:

  • Add changelog entry
  • Fix flaking tests
  • Set ownerRefs to managed VMClusters
  • Add high-level description of VMDistributedCluster and problem space
  • Description-less CRD should be applied for development only. Rephrase descriptions in existing parts to make them fit for production
  • Squash commits
    Keeping original commits for review as its useful to show how the feature was developed
  • Update existing documentation to mention VMDistributedCluster and describe its target architecture and existing shortcomings

@f41gh7 f41gh7 self-assigned this Oct 21, 2025
@AndrewChubatiuk
Copy link
Contributor

initially thought distributed CR is needed for full distributed setup management, but looks like it only performs version upgrade. In this case just curious why we need different CRs for VM, VT and VL?

@vrutkovs vrutkovs force-pushed the vmdistributed-cluster branch from eaeacd4 to c02b24c Compare October 21, 2025 09:01
@vrutkovs
Copy link
Contributor Author

Yes, so far we're focusing on upgrades - existing CRs provide sufficient flexibility IMO - and we didn't get a request for other actions so far.

In this case just curious why we need different CRs for VM, VT and VL?

VL and VT don't have agents (yet) so their specs would be different. However we can reuse the same approach and probably even some helper functions

Copy link
Member

@Haleygo Haleygo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so far we're focusing on upgrades - existing CRs provide sufficient flexibility IMO - and we didn't get a request for other actions so far.

I believe users would expect to modify the vmcluster spec value or apply extra flags to the vmclusters.
And since vmclusterSpec.ClusterVersion is optional, users could specify component versions inside vmclusterSpec which overrides the vmclusterSpec.ClusterVersion.

And currently, it seems VMDistributedCluster only covers a limited scenario where resources like vmcluster, vmuser, vmauth are defined and configured as needed.
Could you please provide an example of how to config them to achieve similar topology described in victoria-metrics-distributed chart? I expect VMDistributedCluster to be supported there when released.

@vrutkovs
Copy link
Contributor Author

I believe users would expect to modify the vmcluster spec value or apply extra flags to the vmclusters.

Yup, setting generic overrideParams would be more flexible and, along with upgrades, would cover other maintenance tasks, i.e., adding replicas or setting flags

@vrutkovs vrutkovs force-pushed the vmdistributed-cluster branch 2 times, most recently from 280b2e6 to 04b44f9 Compare October 30, 2025 08:56
@vrutkovs vrutkovs force-pushed the vmdistributed-cluster branch from 04b44f9 to 1336f73 Compare November 3, 2025 12:43
@vrutkovs vrutkovs force-pushed the vmdistributed-cluster branch from 1336f73 to c3b3e24 Compare November 3, 2025 12:49
// +kubebuilder:printcolumn:name="Status",type="string",JSONPath=".status.updateStatus",description="current status of update rollout"
// +kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"
// VMDistributedClusterSpec is progressively rolling out updates to multiple VMClusters.
type VMDistributedCluster struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature isn’t limited to vmcluster, it can also be applied to vmsingle.
Later, we can support vmsingle with the same CRD by replacing vmcluster objects under VMDistributedClusterSpec with vmsingle.
What about calling it VMDistributed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it's fair to extend this to VMSingle too. I think it would be trivial to extend this to VMSingles - by adding VMType to Zones property. We didn't get a request to extend this to VMSingle though, so I'd prefer to focus on VMClusters.

Not quite sure about VMDistributed - its fair to treat multiple zones as a cluster (so VMCluster would be a fantastic name :) ), while VMDistributed doesn't really pinpoint it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't get a request to extend this to VMSingle though, so I'd prefer to focus on VMClusters.

I disagree. The VMDistributedCluster audience is the same as the distributed chart’s audience. We’ve received requests to support vmsingle there and added it, see VictoriaMetrics/helm-charts#2090, VictoriaMetrics/helm-charts#2517.
So the need exists, and I believe we should cover it if there are no technical blockers, which I don’t see.
However, the current implementation cannot be extended to support vmsingle in zones without introducing a breaking change.

I think we can either expand the zones, like


Zones []zone `json:"zones,omitempty"`


type zone struct {
  vmclusterList  []VMClusterRefOrSpec `json:"vmclusterList,omitempty"`
   // allow adding vmsingleList now or later
}

or rename the zones field to something like vmclusterList as suggested here, then vmsingleList can be added without affect vmclusterList.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be easier to update VMClusterRefOrSpec instead? We may also want to define VMSingles inline.

I agree that its best to have a structure for vmsingles ready now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that makes us rename Spec into ClusterSpec and introduce SingleSpec, which is ugly - so I like your solution better

@vrutkovs vrutkovs force-pushed the vmdistributed-cluster branch 4 times, most recently from a5c6693 to 43e7344 Compare November 10, 2025 09:23
@vrutkovs vrutkovs force-pushed the vmdistributed-cluster branch 5 times, most recently from 4efee35 to 5a92268 Compare November 13, 2025 13:03
@vrutkovs vrutkovs force-pushed the vmdistributed-cluster branch from 529b9b7 to 030e308 Compare December 11, 2025 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for a distributed deployment

5 participants