- Follow the standard Go code style and conventions. Use
gofmtfor formatting and adhere to idiomatic Go practices. - Follow best practices from the Effective Go guide:
- Use MixedCaps or mixedCaps rather than underscores for multi-word names
- Package names should be short, lowercase, single-word names
- Getters don't use "Get" prefix (use
obj.Name()notobj.GetName()) - Interface names use "-er" suffix for single-method interfaces (e.g.,
Reader,Writer)
- Use
gofmtfor consistent formatting (tabs for indentation, spaces for alignment) - Line length: no strict limit, but keep lines reasonable
- Group related declarations together
- Return errors as the last return value
- Check errors immediately after the call
- Provide context with
fmt.Errorfand error wrapping
- Use
ctrl.Logfor structured logging - Keep log fields consistent and meaningful
- Avoid logging sensitive data
- Every exported name should have a doc comment
- Start comments with the name being described
- Use complete sentences
- Share memory by communicating; don't communicate by sharing memory
- Use channels to orchestrate goroutines
- Always handle goroutine cleanup and cancellation properly
- Keep packages focused and cohesive
- Avoid circular dependencies
- Place tests in
*_test.gofiles
Prefer placing documentation in the docs/ directory.
There are 3 main types of documentation targeting different audiences:
-
Developer Documentation - For contributors and maintainers of this project
- Architecture decisions
- Development setup and workflow
- Contributing guidelines
- usually in the
docs/developer-guide/subdirectory
-
Administrator Documentation - For operators deploying and managing the autoscaler controller
- Installation and configuration
- Deployment guidelines
- Monitoring and troubleshooting
- usually located under the
docs/user-guide/directory (for example, in an admin-focused subdirectory)
-
End-User Documentation - For application developers creating applications that use the autoscaler
- Usage guides and examples
- API reference
- Best practices and common patterns
- usually located under the
docs/user-guide/directory (for example, in an end-user-focused subdirectory)
- use make targets for running e2e tests (e.g.,
make test-e2e-smokeormake test-e2e-full) and document the process indocs/developer-guide/testing.md - use
make testfor unit tests - Never use images from docker.io in e2e tests. All container images must use fully-qualified registry paths (e.g.,
registry.k8s.io/,quay.io/, or a private registry). Do not rely on Docker Hub as a default registry.
This section documents the command-line flags and environment variables supported by the llm-d inference scheduler EPP (Endpoint Picker). The EPP inherits its CLI from gateway-api-inference-extension.
Uses gateway-api-inference-extension at commit fd30cb97714a (post-v1.3.0).
| Flag | Type | Default | Description |
|---|---|---|---|
--grpc-port |
int | 9002 |
gRPC port used for communicating with Envoy proxy |
--ha-enable-leader-election |
bool | false |
Enables leader election for high availability. When enabled, readiness probes will only pass on the leader |
--pool-group |
string | inference.networking.k8s.io |
Kubernetes resource group of the InferencePool this Endpoint Picker is associated with |
--pool-namespace |
string | "" |
Namespace of the InferencePool this Endpoint Picker is associated with |
--pool-name |
string | "" |
Name of the InferencePool this Endpoint Picker is associated with |
--endpoint-selector |
string | "" |
Selector to filter model server pods on, only 'key=value' pairs are supported. Format: comma-separated list of key=value pairs (e.g., 'app=vllm-llama3-8b-instruct,env=prod') |
--endpoint-target-ports |
[]int | [] |
Target ports of model server pods. Format: comma-separated list of numbers (e.g., '3000,3001,3002') |
--disable-endpoint-subset-filter |
bool | false |
Disables respecting the x-gateway-destination-endpoint-subset metadata for dispatching requests in EPP |
--model-server-metrics-scheme |
string | http |
Protocol scheme used in scraping metrics from endpoints |
--model-server-metrics-path |
string | /metrics |
URL path used in scraping metrics from endpoints |
--model-server-metrics-port |
int | 0 |
DEPRECATED: Port to scrape metrics from endpoints |
--model-server-metrics-https-insecure-skip-verify |
bool | true |
Disable certificate verification when using 'https' scheme for model-server-metrics-scheme |
--refresh-metrics-interval |
duration | 50ms |
Interval to refresh metrics |
--refresh-prometheus-metrics-interval |
duration | 5s |
Interval to flush Prometheus metrics |
--metrics-staleness-threshold |
duration | 2s |
Duration after which metrics are considered stale |
--total-queued-requests-metric |
string | vllm:num_requests_waiting |
DEPRECATED: Use engineConfigs in EndpointPickerConfig instead |
--total-running-requests-metric |
string | vllm:num_requests_running |
DEPRECATED: Use engineConfigs in EndpointPickerConfig instead |
--kv-cache-usage-percentage-metric |
string | vllm:kv_cache_usage_perc |
DEPRECATED: Use engineConfigs in EndpointPickerConfig instead |
--lora-info-metric |
string | vllm:lora_requests_info |
DEPRECATED: Use engineConfigs in EndpointPickerConfig instead |
--cache-info-metric |
string | vllm:cache_config_info |
DEPRECATED: Use engineConfigs in EndpointPickerConfig instead |
-v, --v |
int | 0 |
Number for the log level verbosity |
--zap-log-level |
string | Zap log level (debug, info, warn, error) | |
--zap-devel |
bool | true |
Development Mode defaults (encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn) |
--zap-encoder |
string | Zap log encoding ('json' or 'console') | |
--zap-stacktrace-level |
string | Zap Level at and above which stacktraces are captured | |
--tracing |
bool | true |
Enables emitting traces |
--health-checking |
bool | false |
Enables health checking |
--metrics-port |
int | 9090 |
The metrics port exposed by EPP |
--grpc-health-port |
int | 9003 |
The port used for gRPC liveness and readiness probes |
--enable-pprof |
bool | true |
Enables pprof handlers |
--cert-path |
string | "" |
The path to the certificate for secure serving. Certificate and private key files are assumed to be named tls.crt and tls.key |
--enable-cert-reload |
bool | false |
Enables certificate reloading of the certificates specified in --cert-path |
--secure-serving |
bool | true |
Enables secure serving |
--metrics-endpoint-auth |
bool | true |
Enables authentication and authorization of the metrics endpoint |
--config-file |
string | "" |
The path to the configuration file |
--config-text |
string | "" |
The configuration specified as text, in lieu of a file |
| Variable | Description | Deprecation |
|---|---|---|
NAMESPACE |
Used to determine pool namespace when --pool-namespace is not set |
- |
POD_NAME |
Used to determine EPP name when using --endpoint-selector mode |
- |
ENABLE_EXPERIMENTAL_DATALAYER_V2 |
Enables experimental pluggable data layer | DEPRECATED: Use FeatureGates in config file instead |
ENABLE_EXPERIMENTAL_FLOW_CONTROL_LAYER |
Enables experimental pluggable flow control layer | DEPRECATED: Use FeatureGates in config file instead |
SD_QUEUE_DEPTH_THRESHOLD |
Saturation detector queue depth threshold | DEPRECATED: Use config file instead |
SD_KV_CACHE_UTIL_THRESHOLD |
Saturation detector KV cache utilization threshold | DEPRECATED: Use config file instead |
SD_METRICS_STALENESS_THRESHOLD |
Saturation detector metrics staleness threshold | DEPRECATED: Use config file instead |
Uses gateway-api-inference-extension v1.3.0.
| Flag | Type | Default | Description |
|---|---|---|---|
--grpc-port |
int | 9002 |
gRPC port used for communicating with Envoy proxy |
--ha-enable-leader-election |
bool | false |
Enables leader election for high availability. When enabled, readiness probes will only pass on the leader |
--pool-group |
string | inference.networking.k8s.io |
Kubernetes resource group of the InferencePool this Endpoint Picker is associated with |
--pool-namespace |
string | "" |
Namespace of the InferencePool this Endpoint Picker is associated with |
--pool-name |
string | "" |
Name of the InferencePool this Endpoint Picker is associated with |
--endpoint-selector |
string | "" |
Selector to filter model server pods on, only 'key=value' pairs are supported. Format: comma-separated list of key=value pairs (e.g., 'app=vllm-llama3-8b-instruct,env=prod') |
--endpoint-target-ports |
[]int | [] |
Target ports of model server pods. Format: comma-separated list of numbers (e.g., '3000,3001,3002') |
--disable-endpoint-subset-filter |
bool | false |
Disables respecting the x-gateway-destination-endpoint-subset metadata for dispatching requests in EPP |
--model-server-metrics-scheme |
string | http |
Protocol scheme used in scraping metrics from endpoints |
--model-server-metrics-path |
string | /metrics |
URL path used in scraping metrics from endpoints |
--model-server-metrics-port |
int | 0 |
DEPRECATED: Port to scrape metrics from endpoints. Set to InferencePool.Spec.TargetPorts[0].Number if not defined |
--model-server-metrics-https-insecure-skip-verify |
bool | true |
Disable certificate verification when using 'https' scheme for model-server-metrics-scheme |
--refresh-metrics-interval |
duration | 50ms |
Interval to refresh metrics |
--refresh-prometheus-metrics-interval |
duration | 5s |
Interval to flush Prometheus metrics |
--metrics-staleness-threshold |
duration | 2s |
Duration after which metrics are considered stale |
--total-queued-requests-metric |
string | vllm:num_requests_waiting |
Prometheus metric for the number of queued requests |
--total-running-requests-metric |
string | vllm:num_requests_running |
Prometheus metric for the number of running requests |
--kv-cache-usage-percentage-metric |
string | vllm:kv_cache_usage_perc |
Prometheus metric for the fraction of KV-cache blocks currently in use (from 0 to 1) |
--lora-info-metric |
string | vllm:lora_requests_info |
Prometheus metric for the LoRA info metrics (must be in vLLM label format) |
--cache-info-metric |
string | vllm:cache_config_info |
Prometheus metric for the cache info metrics |
-v, --v |
int | 0 |
Number for the log level verbosity |
--zap-log-level |
string | Zap log level (debug, info, warn, error) | |
--zap-devel |
bool | true |
Development Mode defaults (encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn) |
--zap-encoder |
string | Zap log encoding ('json' or 'console') | |
--zap-stacktrace-level |
string | Zap Level at and above which stacktraces are captured | |
--tracing |
bool | true |
Enables emitting traces |
--health-checking |
bool | false |
Enables health checking |
--metrics-port |
int | 9090 |
The metrics port exposed by EPP |
--grpc-health-port |
int | 9003 |
The port used for gRPC liveness and readiness probes |
--enable-pprof |
bool | true |
Enables pprof handlers |
--cert-path |
string | "" |
The path to the certificate for secure serving. Certificate and private key files are assumed to be named tls.crt and tls.key |
--enable-cert-reload |
bool | false |
Enables certificate reloading of the certificates specified in --cert-path |
--secure-serving |
bool | true |
Enables secure serving |
--metrics-endpoint-auth |
bool | true |
Enables authentication and authorization of the metrics endpoint |
--config-file |
string | "" |
The path to the configuration file |
--config-text |
string | "" |
The configuration specified as text, in lieu of a file |
| Variable | Description | Deprecation |
|---|---|---|
NAMESPACE |
Used to determine pool namespace when --pool-namespace is not set |
- |
POD_NAME |
Used to determine EPP name when using --endpoint-selector mode |
- |
ENABLE_EXPERIMENTAL_DATALAYER_V2 |
Enables experimental pluggable data layer | DEPRECATED: Use FeatureGates in config file instead |
ENABLE_EXPERIMENTAL_FLOW_CONTROL_LAYER |
Enables experimental pluggable flow control layer | DEPRECATED: Use FeatureGates in config file instead |
SD_QUEUE_DEPTH_THRESHOLD |
Saturation detector queue depth threshold | DEPRECATED: Use config file instead |
SD_KV_CACHE_UTIL_THRESHOLD |
Saturation detector KV cache utilization threshold | DEPRECATED: Use config file instead |
SD_METRICS_STALENESS_THRESHOLD |
Saturation detector metrics staleness threshold | DEPRECATED: Use config file instead |
-
Metric Flags: In main branch,
--total-queued-requests-metric,--total-running-requests-metric,--kv-cache-usage-percentage-metric,--lora-info-metric, and--cache-info-metricare deprecated and will error if explicitly set. In v0.5.0, these flags are functional. -
Configuration: Main branch encourages using
EndpointPickerConfigwithengineConfigsfor metrics configuration instead of CLI flags.
This section documents the command-line flags and environment variables supported by the llm-d inference simulator (llm-d-inference-sim). The simulator is a vLLM server simulator supporting OpenAI API endpoints.
| Flag | Type | Default | Description |
|---|---|---|---|
--config |
string | "" |
Path to a YAML configuration file. Command line values overwrite config file values |
--port |
int | 8000 |
Port on which the simulator runs |
--model |
string | "" |
Currently 'loaded' model name (required) |
--served-model-name |
[]string | [] |
Model names exposed by the API (space-separated strings). Falls back to --model if not set |
--max-num-seqs |
int | 5 |
Maximum number of inference requests that could be processed at the same time |
--max-waiting-queue-length |
int | 1000 |
Maximum length of inference requests waiting queue |
--max-loras |
int | 1 |
Maximum number of LoRAs in a single batch |
--max-cpu-loras |
int | (same as --max-loras) |
Maximum number of LoRAs to store in CPU memory |
--max-model-len |
int | 1024 |
Model's context window, maximum number of tokens in a single request including input and output |
--lora-modules |
[]string | [] |
List of LoRA adapters (space-separated JSON strings) |
--mode |
string | random |
Simulator mode: echo returns input text; random returns random pre-defined sentences |
--seed |
int64 | (current Unix nano) | Random seed for operations |
--time-to-first-token |
duration | 0 |
Time to first token (e.g., "100ms"). Integer format (milliseconds) is deprecated |
--time-to-first-token-std-dev |
duration | 0 |
Standard deviation for time to first token (max 30% of TTFT) |
--inter-token-latency |
duration | 0 |
Time to generate one token (e.g., "100ms"). Integer format is deprecated |
--inter-token-latency-std-dev |
duration | 0 |
Standard deviation for inter-token latency (max 30% of ITL) |
--prefill-overhead |
duration | 0 |
Time to prefill. Ignored if --time-to-first-token is set |
--prefill-time-per-token |
duration | 0 |
Time to prefill per token |
--prefill-time-std-dev |
duration | 0 |
Standard deviation for prefill time |
--kv-cache-transfer-latency |
duration | 0 |
Time for KV-cache transfer from a remote vLLM (P/D mode) |
--kv-cache-transfer-latency-std-dev |
duration | 0 |
Standard deviation for KV-cache transfer latency |
--kv-cache-transfer-time-per-token |
duration | 0 |
Time for KV-cache transfer per token from a remote vLLM |
--kv-cache-transfer-time-std-dev |
duration | 0 |
Standard deviation for KV-cache transfer time per token |
--time-factor-under-load |
float64 | 1.0 |
Multiplicative factor affecting request time when parallel requests are processed (must be >= 1.0) |
--enable-kvcache |
bool | false |
Enables KV cache feature |
--kv-cache-size |
int | 1024 |
Maximum number of token blocks in KV cache |
--global-cache-hit-threshold |
float64 | 0 |
Default cache hit threshold [0, 1] for all requests |
--block-size |
int | 16 |
Token block size for contiguous chunks (valid: 8, 16, 32, 64, 128) |
--tokenizers-cache-dir |
string | hf_cache |
Directory for caching tokenizers |
--hash-seed |
string | "" |
Seed for hash generation (falls back to PYTHONHASHSEED env var) |
--zmq-endpoint |
string | tcp://localhost:5557 |
ZMQ address to publish events |
--zmq-max-connect-attempts |
int | 0 |
Maximum number of times to try ZMQ connect (max 10) |
--event-batch-size |
int | 16 |
Maximum number of KV-cache events to be sent together |
--data-parallel-size |
int | 1 |
Number of ranks to run (1-8) |
--data-parallel-rank |
int | -1 |
The rank when running each rank in a process |
--failure-injection-rate |
int | 0 |
Probability (0-100) of injecting failures |
--failure-types |
[]string | [] |
Specific failure types to inject: rate_limit, invalid_api_key, context_length, server_error, invalid_request, model_not_found |
--fake-metrics |
string | "" |
JSON metrics to report to Prometheus instead of real metrics |
--ssl-certfile |
string | "" |
Path to SSL certificate file for HTTPS |
--ssl-keyfile |
string | "" |
Path to SSL private key file for HTTPS |
--self-signed-certs |
bool | false |
Enable automatic generation of self-signed certificates for HTTPS |
--dataset-path |
string | "" |
Local path to SQLite database file for response generation from a dataset |
--dataset-url |
string | "" |
URL to download the SQLite database file for response generation |
--dataset-in-memory |
bool | false |
Load the entire dataset into memory for faster access |
--enable-sleep-mode |
bool | false |
Enable sleep mode |
--enable-request-id-headers |
bool | false |
Enable including X-Request-Id header in responses |
--latency-calculator |
string | "" |
Name of the latency calculator: constant or per-token |
--max-tool-call-integer-param |
int | 100 |
Maximum possible value of integer parameters in a tool call |
--min-tool-call-integer-param |
int | 0 |
Minimum possible value of integer parameters in a tool call |
--max-tool-call-number-param |
float64 | 100 |
Maximum possible value of number (float) parameters in a tool call |
--min-tool-call-number-param |
float64 | 0 |
Minimum possible value of number (float) parameters in a tool call |
--max-tool-call-array-param-length |
int | 5 |
Maximum possible length of array parameters in a tool call |
--min-tool-call-array-param-length |
int | 1 |
Minimum possible length of array parameters in a tool call |
--tool-call-not-required-param-probability |
int | 50 |
Probability (0-100) to add a non-required parameter in a tool call |
--object-tool-call-not-required-field-probability |
int | 50 |
Probability (0-100) to add a non-required field in an object in a tool call |
| Variable | Description |
|---|---|
POD_NAME |
Pod name of simulator |
POD_NAMESPACE |
Namespace where simulator is running |
POD_IP |
IP address on which simulator runs |
PYTHONHASHSEED |
Fallback seed for hash generation if --hash-seed is not set |
VLLM_SERVER_DEV_MODE |
Set to 1 to enable development mode |
| Flag | Type | Default | Description |
|---|---|---|---|
--config |
string | "" |
Path to a YAML configuration file. Command line values overwrite config file values |
--port |
int | 8000 |
Port on which the simulator runs |
--model |
string | "" |
Currently 'loaded' model name (required) |
--served-model-name |
[]string | [] |
Model names exposed by the API (space-separated strings). Falls back to --model if not set |
--max-num-seqs |
int | 5 |
Maximum number of inference requests that could be processed at the same time (parameter to simulate requests waiting queue) |
--max-loras |
int | 1 |
Maximum number of LoRAs in a single batch |
--max-cpu-loras |
int | (same as --max-loras) |
Maximum number of LoRAs to store in CPU memory |
--max-model-len |
int | 1024 |
Model's context window, maximum number of tokens in a single request including input and output |
--lora-modules |
[]string | [] |
List of LoRA adapters (space-separated JSON strings) |
--mode |
string | random |
Simulator mode: echo returns input text; random returns random pre-defined sentences |
--seed |
int64 | (current Unix nano) | Random seed for operations |
--time-to-first-token |
int | 0 |
Time to first token in milliseconds |
--time-to-first-token-std-dev |
int | 0 |
Standard deviation for time to first token in milliseconds (max 30% of TTFT) |
--inter-token-latency |
int | 0 |
Time to generate one token in milliseconds |
--inter-token-latency-std-dev |
int | 0 |
Standard deviation for inter-token latency in milliseconds (max 30% of ITL) |
--prefill-overhead |
int | 0 |
Time to prefill in milliseconds. Ignored if --time-to-first-token is not 0 |
--prefill-time-per-token |
int | 0 |
Time to prefill per token in milliseconds |
--prefill-time-std-dev |
int | 0 |
Standard deviation for prefill time in milliseconds |
--kv-cache-transfer-latency |
int | 0 |
Time for KV-cache transfer from a remote vLLM in milliseconds (P/D mode) |
--kv-cache-transfer-latency-std-dev |
int | 0 |
Standard deviation for KV-cache transfer latency in milliseconds |
--kv-cache-transfer-time-per-token |
int | 0 |
Time for KV-cache transfer per token from a remote vLLM in milliseconds |
--kv-cache-transfer-time-std-dev |
int | 0 |
Standard deviation for KV-cache transfer time per token in milliseconds |
--time-factor-under-load |
float64 | 1.0 |
Multiplicative factor affecting request time when parallel requests are processed (must be >= 1.0) |
--enable-kvcache |
bool | false |
Enables KV cache feature |
--kv-cache-size |
int | 1024 |
Maximum number of token blocks in KV cache |
--block-size |
int | 16 |
Token block size for contiguous chunks (valid: 8, 16, 32, 64, 128) |
--tokenizers-cache-dir |
string | "" |
Directory for caching tokenizers |
--hash-seed |
string | "" |
Seed for hash generation (falls back to PYTHONHASHSEED env var) |
--zmq-endpoint |
string | tcp://localhost:5557 |
ZMQ address to publish events |
--zmq-max-connect-attempts |
uint | 0 |
Maximum number of times to try ZMQ connect (max 10) |
--event-batch-size |
int | 16 |
Maximum number of KV-cache events to be sent together |
--data-parallel-size |
int | 1 |
Number of ranks to run (1-8) |
--failure-injection-rate |
int | 0 |
Probability (0-100) of injecting failures |
--failure-types |
[]string | [] |
Specific failure types to inject: rate_limit, invalid_api_key, context_length, server_error, invalid_request, model_not_found |
--fake-metrics |
string | "" |
JSON metrics to report to Prometheus instead of real metrics |
--max-tool-call-integer-param |
int | 100 |
Maximum possible value of integer parameters in a tool call |
--min-tool-call-integer-param |
int | 0 |
Minimum possible value of integer parameters in a tool call |
--max-tool-call-number-param |
float64 | 100 |
Maximum possible value of number (float) parameters in a tool call |
--min-tool-call-number-param |
float64 | 0 |
Minimum possible value of number (float) parameters in a tool call |
--max-tool-call-array-param-length |
int | 5 |
Maximum possible length of array parameters in a tool call |
--min-tool-call-array-param-length |
int | 1 |
Minimum possible length of array parameters in a tool call |
--tool-call-not-required-param-probability |
int | 50 |
Probability (0-100) to add a non-required parameter in a tool call |
--object-tool-call-not-required-field-probability |
int | 50 |
Probability (0-100) to add a non-required field in an object in a tool call |
| Variable | Description |
|---|---|
POD_NAME |
Pod name of simulator |
POD_NAMESPACE |
Namespace where simulator is running |
PYTHONHASHSEED |
Fallback seed for hash generation if --hash-seed is not set |
-
Duration Parameters: In main branch, latency-related parameters (
--time-to-first-token,--inter-token-latency, etc.) use Go duration strings (e.g., "100ms", "1.5s"). In v0.5.0, these are integers representing milliseconds. -
New Flags in Main:
--max-waiting-queue-length,--global-cache-hit-threshold,--data-parallel-rank,--ssl-certfile,--ssl-keyfile,--self-signed-certs,--dataset-path,--dataset-url,--dataset-in-memory,--enable-sleep-mode,--enable-request-id-headers,--latency-calculator. -
Environment Variables: Main branch adds
POD_IPandVLLM_SERVER_DEV_MODE.