Skip to content

feat: add gRPC client-side load balancing for trillian-logserver#1752

Draft
kdacosta0 wants to merge 3 commits intomainfrom
kdacosta/load_balancing
Draft

feat: add gRPC client-side load balancing for trillian-logserver#1752
kdacosta0 wants to merge 3 commits intomainfrom
kdacosta/load_balancing

Conversation

@kdacosta0
Copy link
Copy Markdown
Member

@kdacosta0 kdacosta0 commented Apr 27, 2026

Summary

  • Switch trillian-logserver service to headless to enable gRPC client-side
    load balancing (round_robin), fixes severely uneven CPU distribution
    across replicas caused by the default pick_first policy
  • Add grpc_default_service_config flag to rekor-server for round_robin.
  • Handle ClusterIP -> headless migration (ClusterIP is immutable, so existing
    services are deleted and recreated)

Test

  • Verified with rhtas-benchmark: trillian-logserver load balances evenly
    across replicas after the change
  • Upgrade from previous version tested, migration from ClusterIP to
    headless service works correctly

Notes

  • ctlog already load balances by default, no changes needed
  • fulcio -> ctlog uses HTTP/2 via a regular ClusterIP service; improving
    that path is out of scope for this fix

Known limitation

New trillian-logserver replicas will not receive traffic until an existing connection is lost, which triggers DNS re-resolution

ref: SECURESIGN-2935

Signed-off-by: kdacosta0 <kristian.dacosta.menezes@gmail.com>
@kdacosta0 kdacosta0 requested a review from osmman April 27, 2026 14:14
@kdacosta0 kdacosta0 self-assigned this Apr 27, 2026
@qodo-code-review qodo-code-review Bot added the enhancement New feature or request label Apr 27, 2026
@qodo-code-review
Copy link
Copy Markdown

PR Type

Enhancement


Description

  • Enable gRPC client-side load balancing for trillian-logserver by converting service to headless

  • Add migration logic to delete and recreate existing ClusterIP services as headless

  • Configure rekor-server with round_robin load balancing policy via gRPC service config

  • Add comprehensive test coverage for headless service migration and creation


File Walkthrough

Relevant files
Enhancement
deployment.go
Add round_robin load balancing config to rekor-server       

internal/controller/rekor/actions/server/deployment.go

  • Add grpc_default_service_config flag to rekor-server deployment with
    round_robin load balancing configuration
  • Configure gRPC to use round_robin policy for distributing requests
    across trillian-logserver replicas
+1/-0     
service.go
Migrate trillian-logserver service to headless with ClusterIP deletion

internal/controller/trillian/actions/logserver/service.go

  • Implement migrateToHeadless() method to detect and delete existing
    ClusterIP services for migration
  • Switch service creation from EnsureServiceSpec to
    EnsureHeadlessServiceSpec for headless configuration
  • Add imports for error handling and Kubernetes types needed for
    migration logic
  • Include migration check in service creation workflow with requeue on
    deletion
+41/-1   
service.go
Add headless service configuration utility function           

internal/utils/kubernetes/service.go

  • Add new EnsureHeadlessServiceSpec() function to configure services
    with ClusterIP=None
  • Maintain existing EnsureServiceSpec() for backward compatibility with
    regular ClusterIP services
  • Include documentation explaining headless service requirement for gRPC
    client-side load balancing
+13/-0   
Tests
service_test.go
Add tests for headless service migration logic                     

internal/controller/trillian/actions/logserver/service_test.go

  • Add comprehensive test suite for migrateToHeadless() covering three
    scenarios: no existing service, ClusterIP migration, and existing
    headless service
  • Add integration test verifying headless service creation with correct
    ClusterIP=None configuration
  • Test error handling and service deletion during migration process
+153/-0 
service_test.go
Add tests for headless service utility function                   

internal/utils/kubernetes/service_test.go

  • Add test suite for EnsureHeadlessServiceSpec() covering service
    creation and updates
  • Verify ClusterIP is set to None for headless services
  • Test both creation and idempotent update scenarios
+54/-0   

@qodo-code-review
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶

SECURESIGN-2935 - Partially compliant

Compliant requirements:

  • Change the client-side load balancing policy from default (pick_first) to round_robin so clients keep subchannels open across resolved addresses.
  • Ensure traffic distributes evenly across replicas for trillian-logserver where applicable.

Non-compliant requirements:

  • Ensure traffic distributes evenly across replicas for ctlog and trillian-logsigner where applicable.

Requires further human verification:

  • Investigate which gRPC clients in the RHTAS stack call Trillian/CTLog services and exhibit uneven load distribution.
  • Provide/validate behavior in HA (multi-replica) deployments and document the fix via release note context.
⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🔒 No security concerns identified
⚡ Recommended focus areas for review

Config Validity

The --trillian_log_server.grpc_default_service_config flag is injected as an inline JSON string. Verify the server’s flag parsing accepts this exact JSON structure/escaping and that it doesn’t require additional wrapping/escaping or different key casing. Also validate this is applied to the correct client (the Rekor -> Trillian gRPC client) and that it takes effect only when DNS returns multiple endpoints.

args := []string{
	"serve",
	"--trillian_log_server.address", instance.Spec.Trillian.Address,
	"--trillian_log_server.port", strconv.Itoa(int(*instance.Spec.Trillian.Port)),
	"--trillian_log_server.sharding_config", "/sharding/sharding-config.yaml",
	"--trillian_log_server.grpc_default_service_config", `{"loadBalancingConfig":[{"round_robin":{}}]}`,

	"--rekor_server.address", "0.0.0.0",
	// boolean flag MUST be without parameter (default value) or use the equal sign (https://github.com/spf13/pflag?tab=readme-ov-file#command-line-flag-syntax)
	"--enable_retrieve_api=true",
Migration Behavior

The migration deletes an existing Service when spec.clusterIP is not None and then immediately requeues. Deleting and recreating the Service can cause temporary loss of connectivity and potential endpoint churn; verify reconcile ordering and readiness behavior are acceptable (e.g., clients retry properly) and consider whether additional safeguards are needed (finalizers, observing deletion completion, or emitting status/conditions) to make upgrades smoother.

	// Migrate existing ClusterIP service to headless: ClusterIP is immutable,
	// so we must delete and recreate if it was previously a regular service.
	if migrated, err := i.migrateToHeadless(ctx, instance); err != nil {
		return i.Error(ctx, fmt.Errorf("could not migrate service to headless: %w", err), instance)
	} else if migrated {
		return i.Requeue()
	}

	if result, err = kubernetes.CreateOrUpdate(ctx, i.Client,
		&v1.Service{
			ObjectMeta: metav1.ObjectMeta{Name: actions.LogserverDeploymentName, Namespace: instance.Namespace},
		},
		kubernetes.EnsureHeadlessServiceSpec(labels, ports...),
		ensure.ControllerReference[*v1.Service](instance, i.Client),
		ensure.Labels[*v1.Service](slices.Collect(maps.Keys(labels)), labels),
		//TLS: Annotate service
		ensure.Optional(kubernetes.IsOpenShift(), ensure.Annotations[*v1.Service]([]string{annotations.TLS}, tlsAnnotations)),
	); err != nil {
		return i.Error(ctx, fmt.Errorf("could not create service: %w", err), instance)
	}

	if result != controllerutil.OperationResultNone {
		meta.SetStatusCondition(&instance.Status.Conditions, metav1.Condition{
			Type:    actions.ServerCondition,
			Status:  metav1.ConditionFalse,
			Reason:  state.Creating.String(),
			Message: "Service created",
		})
		return i.StatusUpdate(ctx, instance)
	} else {
		return i.Continue()
	}
}

// migrateToHeadless checks if the existing trillian-logserver service has a
// ClusterIP assigned (non-headless). Since ClusterIP is an immutable field,
// the service must be deleted so it can be recreated as headless on the next
// reconciliation. Headless services are required for gRPC client-side load
// balancing (round_robin) because DNS must return individual pod IPs.
func (i createServiceAction) migrateToHeadless(ctx context.Context, instance *rhtasv1alpha1.Trillian) (bool, error) {
	existing := &v1.Service{}
	err := i.Client.Get(ctx, types.NamespacedName{
		Name:      actions.LogserverDeploymentName,
		Namespace: instance.Namespace,
	}, existing)
	if apiErrors.IsNotFound(err) {
		return false, nil
	}
	if err != nil {
		return false, err
	}

	if existing.Spec.ClusterIP != v1.ClusterIPNone {
		i.Logger.Info("Deleting ClusterIP service to recreate as headless for gRPC load balancing",
			"service", actions.LogserverDeploymentName)
		if err := i.Client.Delete(ctx, existing); err != nil && !apiErrors.IsNotFound(err) {
			return false, fmt.Errorf("failed to delete service: %w", err)
		}
		return true, nil
	}

	return false, nil
Spec Completeness

EnsureHeadlessServiceSpec sets spec.clusterIP = None but does not set/clear related fields like spec.clusterIPs (dual-stack) or other Service spec attributes that may be required/immutable depending on existing cluster configuration. Validate behavior on clusters using dual-stack or where clusterIPs is present, and ensure CreateOrUpdate won’t fail due to immutable field mismatches in those environments.

// EnsureHeadlessServiceSpec sets ClusterIP to None, making the service headless.
// Headless services return individual pod IPs in DNS responses instead of a
// single virtual IP, which is required for gRPC client-side load balancing.
func EnsureHeadlessServiceSpec(selectorLabels map[string]string, ports ...corev1.ServicePort) func(*corev1.Service) error {
	return func(svc *corev1.Service) error {
		spec := &svc.Spec
		spec.Selector = selectorLabels
		spec.Ports = ports
		spec.ClusterIP = corev1.ClusterIPNone
		return nil
	}
📄 References
  1. No matching references available

@qodo-code-review
Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Prefix gRPC address with dns:///

Prepend the dns:/// scheme to instance.Spec.Trillian.Address in the deployment
arguments. This forces the gRPC client to use the DNS resolver and successfully
load-balance across all pods.

Examples:

internal/controller/rekor/actions/server/deployment.go [139-144]
			"serve",
			"--trillian_log_server.address", instance.Spec.Trillian.Address,
			"--trillian_log_server.port", strconv.Itoa(int(*instance.Spec.Trillian.Port)),
			"--trillian_log_server.sharding_config", "/sharding/sharding-config.yaml",
			"--trillian_log_server.grpc_default_service_config", `{"loadBalancingConfig":[{"round_robin":{}}]}`,

Solution Walkthrough:

Before:

		args := []string{
			"serve",
			"--trillian_log_server.address", instance.Spec.Trillian.Address,
			"--trillian_log_server.port", strconv.Itoa(int(*instance.Spec.Trillian.Port)),
			"--trillian_log_server.sharding_config", "/sharding/sharding-config.yaml",
			"--trillian_log_server.grpc_default_service_config", `{"loadBalancingConfig":[{"round_robin":{}}]}`,
			// ...
		}

After:

		args := []string{
			"serve",
			"--trillian_log_server.address", fmt.Sprintf("dns:///%s", instance.Spec.Trillian.Address),
			"--trillian_log_server.port", strconv.Itoa(int(*instance.Spec.Trillian.Port)),
			"--trillian_log_server.sharding_config", "/sharding/sharding-config.yaml",
			"--trillian_log_server.grpc_default_service_config", `{"loadBalancingConfig":[{"round_robin":{}}]}`,
			// ...
		}
Suggestion importance[1-10]: 9

__

Why: Without the dns:/// prefix, the gRPC client relies on the default passthrough resolver, which fails to fetch all pod IPs from the headless service, rendering the round_robin policy ineffective.

High
General
set headless service type explicitly

Explicitly set the service type to ClusterIP in the service spec.

internal/utils/kubernetes/service.go [42-50]

 func EnsureHeadlessServiceSpec(selectorLabels map[string]string, ports ...corev1.ServicePort) func(*corev1.Service) error {
   return func(svc *corev1.Service) error {
     spec := &svc.Spec
+    spec.Type = corev1.ServiceTypeClusterIP
     spec.Selector = selectorLabels
     spec.Ports = ports
     spec.ClusterIP = corev1.ClusterIPNone
     return nil
   }
 }
  • Apply / Chat
Suggestion importance[1-10]: 2

__

Why: Kubernetes automatically defaults the service Type to ClusterIP, making the explicit setting of Type unnecessary and only a marginal improvement.

Low
  • More

@kdacosta0 kdacosta0 marked this pull request as draft April 27, 2026 15:43
@kdacosta0 kdacosta0 marked this pull request as draft April 27, 2026 15:43
Signed-off-by: kdacosta0 <kristian.dacosta.menezes@gmail.com>
@kdacosta0
Copy link
Copy Markdown
Member Author

Waiting for sigstore/rekor#2812

@kdacosta0 kdacosta0 changed the base branch from release-1.4 to main April 29, 2026 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request Review effort 3/5

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant