Sticky Load Balancing during inference not hitting all the Triton Servers sitting behind the Load Balancer

I have an architecture that looks like this:

![Image](https://github.com/user-attachments/assets/3641a332-bee3-4478-aabe-ef8f96b401e3)

The main issue is that the load balancer is not distributing the traffic across the different server VMs. Instead at any given time I only see 40-50% of the VMs under full load being used. I am closing the gRPC client after each request using `client.close()`. However, it seems that session is still sticky, which is likely the root cause. 

I am using [Azure Load Balancer](https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-overview).

Would love to know if anyone has faced this issue and resolved it. If so, how? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sticky Load Balancing during inference not hitting all the Triton Servers sitting behind the Load Balancer #820

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sticky Load Balancing during inference not hitting all the Triton Servers sitting behind the Load Balancer #820

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions