Skip to content

Latest commit

 

History

History
346 lines (292 loc) · 25.7 KB

File metadata and controls

346 lines (292 loc) · 25.7 KB

Claude Code Assistant Guidelines

Go Code Style

  • Follow the standard Go code style and conventions. Use gofmt for formatting and adhere to idiomatic Go practices.
  • Follow best practices from the Effective Go guide:

Naming Conventions

  • Use MixedCaps or mixedCaps rather than underscores for multi-word names
  • Package names should be short, lowercase, single-word names
  • Getters don't use "Get" prefix (use obj.Name() not obj.GetName())
  • Interface names use "-er" suffix for single-method interfaces (e.g., Reader, Writer)

Formatting

  • Use gofmt for consistent formatting (tabs for indentation, spaces for alignment)
  • Line length: no strict limit, but keep lines reasonable
  • Group related declarations together

Error Handling

  • Return errors as the last return value
  • Check errors immediately after the call
  • Provide context with fmt.Errorf and error wrapping

Logging

  • Use ctrl.Log for structured logging
  • Keep log fields consistent and meaningful
  • Avoid logging sensitive data

Documentation

  • Every exported name should have a doc comment
  • Start comments with the name being described
  • Use complete sentences

Concurrency

  • Share memory by communicating; don't communicate by sharing memory
  • Use channels to orchestrate goroutines
  • Always handle goroutine cleanup and cancellation properly

Project Structure

  • Keep packages focused and cohesive
  • Avoid circular dependencies
  • Place tests in *_test.go files

Documentation

Prefer placing documentation in the docs/ directory.

There are 3 main types of documentation targeting different audiences:

  1. Developer Documentation - For contributors and maintainers of this project

    • Architecture decisions
    • Development setup and workflow
    • Contributing guidelines
    • usually in the docs/developer-guide/ subdirectory
  2. Administrator Documentation - For operators deploying and managing the autoscaler controller

    • Installation and configuration
    • Deployment guidelines
    • Monitoring and troubleshooting
    • usually located under the docs/user-guide/ directory (for example, in an admin-focused subdirectory)
  3. End-User Documentation - For application developers creating applications that use the autoscaler

    • Usage guides and examples
    • API reference
    • Best practices and common patterns
    • usually located under the docs/user-guide/ directory (for example, in an end-user-focused subdirectory)

E2E Testing

  • use make targets for running e2e tests (e.g., make test-e2e-smoke or make test-e2e-full) and document the process in docs/developer-guide/testing.md
  • use make test for unit tests
  • Never use images from docker.io in e2e tests. All container images must use fully-qualified registry paths (e.g., registry.k8s.io/, quay.io/, or a private registry). Do not rely on Docker Hub as a default registry.

CLI Tools

llm-d Inference Scheduler EPP CLI Reference

This section documents the command-line flags and environment variables supported by the llm-d inference scheduler EPP (Endpoint Picker). The EPP inherits its CLI from gateway-api-inference-extension.

Main Branch (Latest)

Uses gateway-api-inference-extension at commit fd30cb97714a (post-v1.3.0).

Command-Line Flags
Flag Type Default Description
--grpc-port int 9002 gRPC port used for communicating with Envoy proxy
--ha-enable-leader-election bool false Enables leader election for high availability. When enabled, readiness probes will only pass on the leader
--pool-group string inference.networking.k8s.io Kubernetes resource group of the InferencePool this Endpoint Picker is associated with
--pool-namespace string "" Namespace of the InferencePool this Endpoint Picker is associated with
--pool-name string "" Name of the InferencePool this Endpoint Picker is associated with
--endpoint-selector string "" Selector to filter model server pods on, only 'key=value' pairs are supported. Format: comma-separated list of key=value pairs (e.g., 'app=vllm-llama3-8b-instruct,env=prod')
--endpoint-target-ports []int [] Target ports of model server pods. Format: comma-separated list of numbers (e.g., '3000,3001,3002')
--disable-endpoint-subset-filter bool false Disables respecting the x-gateway-destination-endpoint-subset metadata for dispatching requests in EPP
--model-server-metrics-scheme string http Protocol scheme used in scraping metrics from endpoints
--model-server-metrics-path string /metrics URL path used in scraping metrics from endpoints
--model-server-metrics-port int 0 DEPRECATED: Port to scrape metrics from endpoints
--model-server-metrics-https-insecure-skip-verify bool true Disable certificate verification when using 'https' scheme for model-server-metrics-scheme
--refresh-metrics-interval duration 50ms Interval to refresh metrics
--refresh-prometheus-metrics-interval duration 5s Interval to flush Prometheus metrics
--metrics-staleness-threshold duration 2s Duration after which metrics are considered stale
--total-queued-requests-metric string vllm:num_requests_waiting DEPRECATED: Use engineConfigs in EndpointPickerConfig instead
--total-running-requests-metric string vllm:num_requests_running DEPRECATED: Use engineConfigs in EndpointPickerConfig instead
--kv-cache-usage-percentage-metric string vllm:kv_cache_usage_perc DEPRECATED: Use engineConfigs in EndpointPickerConfig instead
--lora-info-metric string vllm:lora_requests_info DEPRECATED: Use engineConfigs in EndpointPickerConfig instead
--cache-info-metric string vllm:cache_config_info DEPRECATED: Use engineConfigs in EndpointPickerConfig instead
-v, --v int 0 Number for the log level verbosity
--zap-log-level string Zap log level (debug, info, warn, error)
--zap-devel bool true Development Mode defaults (encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn)
--zap-encoder string Zap log encoding ('json' or 'console')
--zap-stacktrace-level string Zap Level at and above which stacktraces are captured
--tracing bool true Enables emitting traces
--health-checking bool false Enables health checking
--metrics-port int 9090 The metrics port exposed by EPP
--grpc-health-port int 9003 The port used for gRPC liveness and readiness probes
--enable-pprof bool true Enables pprof handlers
--cert-path string "" The path to the certificate for secure serving. Certificate and private key files are assumed to be named tls.crt and tls.key
--enable-cert-reload bool false Enables certificate reloading of the certificates specified in --cert-path
--secure-serving bool true Enables secure serving
--metrics-endpoint-auth bool true Enables authentication and authorization of the metrics endpoint
--config-file string "" The path to the configuration file
--config-text string "" The configuration specified as text, in lieu of a file
Environment Variables
Variable Description Deprecation
NAMESPACE Used to determine pool namespace when --pool-namespace is not set -
POD_NAME Used to determine EPP name when using --endpoint-selector mode -
ENABLE_EXPERIMENTAL_DATALAYER_V2 Enables experimental pluggable data layer DEPRECATED: Use FeatureGates in config file instead
ENABLE_EXPERIMENTAL_FLOW_CONTROL_LAYER Enables experimental pluggable flow control layer DEPRECATED: Use FeatureGates in config file instead
SD_QUEUE_DEPTH_THRESHOLD Saturation detector queue depth threshold DEPRECATED: Use config file instead
SD_KV_CACHE_UTIL_THRESHOLD Saturation detector KV cache utilization threshold DEPRECATED: Use config file instead
SD_METRICS_STALENESS_THRESHOLD Saturation detector metrics staleness threshold DEPRECATED: Use config file instead

v0.5.0

Uses gateway-api-inference-extension v1.3.0.

Command-Line Flags
Flag Type Default Description
--grpc-port int 9002 gRPC port used for communicating with Envoy proxy
--ha-enable-leader-election bool false Enables leader election for high availability. When enabled, readiness probes will only pass on the leader
--pool-group string inference.networking.k8s.io Kubernetes resource group of the InferencePool this Endpoint Picker is associated with
--pool-namespace string "" Namespace of the InferencePool this Endpoint Picker is associated with
--pool-name string "" Name of the InferencePool this Endpoint Picker is associated with
--endpoint-selector string "" Selector to filter model server pods on, only 'key=value' pairs are supported. Format: comma-separated list of key=value pairs (e.g., 'app=vllm-llama3-8b-instruct,env=prod')
--endpoint-target-ports []int [] Target ports of model server pods. Format: comma-separated list of numbers (e.g., '3000,3001,3002')
--disable-endpoint-subset-filter bool false Disables respecting the x-gateway-destination-endpoint-subset metadata for dispatching requests in EPP
--model-server-metrics-scheme string http Protocol scheme used in scraping metrics from endpoints
--model-server-metrics-path string /metrics URL path used in scraping metrics from endpoints
--model-server-metrics-port int 0 DEPRECATED: Port to scrape metrics from endpoints. Set to InferencePool.Spec.TargetPorts[0].Number if not defined
--model-server-metrics-https-insecure-skip-verify bool true Disable certificate verification when using 'https' scheme for model-server-metrics-scheme
--refresh-metrics-interval duration 50ms Interval to refresh metrics
--refresh-prometheus-metrics-interval duration 5s Interval to flush Prometheus metrics
--metrics-staleness-threshold duration 2s Duration after which metrics are considered stale
--total-queued-requests-metric string vllm:num_requests_waiting Prometheus metric for the number of queued requests
--total-running-requests-metric string vllm:num_requests_running Prometheus metric for the number of running requests
--kv-cache-usage-percentage-metric string vllm:kv_cache_usage_perc Prometheus metric for the fraction of KV-cache blocks currently in use (from 0 to 1)
--lora-info-metric string vllm:lora_requests_info Prometheus metric for the LoRA info metrics (must be in vLLM label format)
--cache-info-metric string vllm:cache_config_info Prometheus metric for the cache info metrics
-v, --v int 0 Number for the log level verbosity
--zap-log-level string Zap log level (debug, info, warn, error)
--zap-devel bool true Development Mode defaults (encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn)
--zap-encoder string Zap log encoding ('json' or 'console')
--zap-stacktrace-level string Zap Level at and above which stacktraces are captured
--tracing bool true Enables emitting traces
--health-checking bool false Enables health checking
--metrics-port int 9090 The metrics port exposed by EPP
--grpc-health-port int 9003 The port used for gRPC liveness and readiness probes
--enable-pprof bool true Enables pprof handlers
--cert-path string "" The path to the certificate for secure serving. Certificate and private key files are assumed to be named tls.crt and tls.key
--enable-cert-reload bool false Enables certificate reloading of the certificates specified in --cert-path
--secure-serving bool true Enables secure serving
--metrics-endpoint-auth bool true Enables authentication and authorization of the metrics endpoint
--config-file string "" The path to the configuration file
--config-text string "" The configuration specified as text, in lieu of a file
Environment Variables
Variable Description Deprecation
NAMESPACE Used to determine pool namespace when --pool-namespace is not set -
POD_NAME Used to determine EPP name when using --endpoint-selector mode -
ENABLE_EXPERIMENTAL_DATALAYER_V2 Enables experimental pluggable data layer DEPRECATED: Use FeatureGates in config file instead
ENABLE_EXPERIMENTAL_FLOW_CONTROL_LAYER Enables experimental pluggable flow control layer DEPRECATED: Use FeatureGates in config file instead
SD_QUEUE_DEPTH_THRESHOLD Saturation detector queue depth threshold DEPRECATED: Use config file instead
SD_KV_CACHE_UTIL_THRESHOLD Saturation detector KV cache utilization threshold DEPRECATED: Use config file instead
SD_METRICS_STALENESS_THRESHOLD Saturation detector metrics staleness threshold DEPRECATED: Use config file instead

Key Differences Between Main and v0.5.0

  1. Metric Flags: In main branch, --total-queued-requests-metric, --total-running-requests-metric, --kv-cache-usage-percentage-metric, --lora-info-metric, and --cache-info-metric are deprecated and will error if explicitly set. In v0.5.0, these flags are functional.

  2. Configuration: Main branch encourages using EndpointPickerConfig with engineConfigs for metrics configuration instead of CLI flags.


llm-d Inference Simulator CLI Reference

This section documents the command-line flags and environment variables supported by the llm-d inference simulator (llm-d-inference-sim). The simulator is a vLLM server simulator supporting OpenAI API endpoints.

Main Branch (Latest)

Command-Line Flags
Flag Type Default Description
--config string "" Path to a YAML configuration file. Command line values overwrite config file values
--port int 8000 Port on which the simulator runs
--model string "" Currently 'loaded' model name (required)
--served-model-name []string [] Model names exposed by the API (space-separated strings). Falls back to --model if not set
--max-num-seqs int 5 Maximum number of inference requests that could be processed at the same time
--max-waiting-queue-length int 1000 Maximum length of inference requests waiting queue
--max-loras int 1 Maximum number of LoRAs in a single batch
--max-cpu-loras int (same as --max-loras) Maximum number of LoRAs to store in CPU memory
--max-model-len int 1024 Model's context window, maximum number of tokens in a single request including input and output
--lora-modules []string [] List of LoRA adapters (space-separated JSON strings)
--mode string random Simulator mode: echo returns input text; random returns random pre-defined sentences
--seed int64 (current Unix nano) Random seed for operations
--time-to-first-token duration 0 Time to first token (e.g., "100ms"). Integer format (milliseconds) is deprecated
--time-to-first-token-std-dev duration 0 Standard deviation for time to first token (max 30% of TTFT)
--inter-token-latency duration 0 Time to generate one token (e.g., "100ms"). Integer format is deprecated
--inter-token-latency-std-dev duration 0 Standard deviation for inter-token latency (max 30% of ITL)
--prefill-overhead duration 0 Time to prefill. Ignored if --time-to-first-token is set
--prefill-time-per-token duration 0 Time to prefill per token
--prefill-time-std-dev duration 0 Standard deviation for prefill time
--kv-cache-transfer-latency duration 0 Time for KV-cache transfer from a remote vLLM (P/D mode)
--kv-cache-transfer-latency-std-dev duration 0 Standard deviation for KV-cache transfer latency
--kv-cache-transfer-time-per-token duration 0 Time for KV-cache transfer per token from a remote vLLM
--kv-cache-transfer-time-std-dev duration 0 Standard deviation for KV-cache transfer time per token
--time-factor-under-load float64 1.0 Multiplicative factor affecting request time when parallel requests are processed (must be >= 1.0)
--enable-kvcache bool false Enables KV cache feature
--kv-cache-size int 1024 Maximum number of token blocks in KV cache
--global-cache-hit-threshold float64 0 Default cache hit threshold [0, 1] for all requests
--block-size int 16 Token block size for contiguous chunks (valid: 8, 16, 32, 64, 128)
--tokenizers-cache-dir string hf_cache Directory for caching tokenizers
--hash-seed string "" Seed for hash generation (falls back to PYTHONHASHSEED env var)
--zmq-endpoint string tcp://localhost:5557 ZMQ address to publish events
--zmq-max-connect-attempts int 0 Maximum number of times to try ZMQ connect (max 10)
--event-batch-size int 16 Maximum number of KV-cache events to be sent together
--data-parallel-size int 1 Number of ranks to run (1-8)
--data-parallel-rank int -1 The rank when running each rank in a process
--failure-injection-rate int 0 Probability (0-100) of injecting failures
--failure-types []string [] Specific failure types to inject: rate_limit, invalid_api_key, context_length, server_error, invalid_request, model_not_found
--fake-metrics string "" JSON metrics to report to Prometheus instead of real metrics
--ssl-certfile string "" Path to SSL certificate file for HTTPS
--ssl-keyfile string "" Path to SSL private key file for HTTPS
--self-signed-certs bool false Enable automatic generation of self-signed certificates for HTTPS
--dataset-path string "" Local path to SQLite database file for response generation from a dataset
--dataset-url string "" URL to download the SQLite database file for response generation
--dataset-in-memory bool false Load the entire dataset into memory for faster access
--enable-sleep-mode bool false Enable sleep mode
--enable-request-id-headers bool false Enable including X-Request-Id header in responses
--latency-calculator string "" Name of the latency calculator: constant or per-token
--max-tool-call-integer-param int 100 Maximum possible value of integer parameters in a tool call
--min-tool-call-integer-param int 0 Minimum possible value of integer parameters in a tool call
--max-tool-call-number-param float64 100 Maximum possible value of number (float) parameters in a tool call
--min-tool-call-number-param float64 0 Minimum possible value of number (float) parameters in a tool call
--max-tool-call-array-param-length int 5 Maximum possible length of array parameters in a tool call
--min-tool-call-array-param-length int 1 Minimum possible length of array parameters in a tool call
--tool-call-not-required-param-probability int 50 Probability (0-100) to add a non-required parameter in a tool call
--object-tool-call-not-required-field-probability int 50 Probability (0-100) to add a non-required field in an object in a tool call
Environment Variables
Variable Description
POD_NAME Pod name of simulator
POD_NAMESPACE Namespace where simulator is running
POD_IP IP address on which simulator runs
PYTHONHASHSEED Fallback seed for hash generation if --hash-seed is not set
VLLM_SERVER_DEV_MODE Set to 1 to enable development mode

v0.5.0

Command-Line Flags
Flag Type Default Description
--config string "" Path to a YAML configuration file. Command line values overwrite config file values
--port int 8000 Port on which the simulator runs
--model string "" Currently 'loaded' model name (required)
--served-model-name []string [] Model names exposed by the API (space-separated strings). Falls back to --model if not set
--max-num-seqs int 5 Maximum number of inference requests that could be processed at the same time (parameter to simulate requests waiting queue)
--max-loras int 1 Maximum number of LoRAs in a single batch
--max-cpu-loras int (same as --max-loras) Maximum number of LoRAs to store in CPU memory
--max-model-len int 1024 Model's context window, maximum number of tokens in a single request including input and output
--lora-modules []string [] List of LoRA adapters (space-separated JSON strings)
--mode string random Simulator mode: echo returns input text; random returns random pre-defined sentences
--seed int64 (current Unix nano) Random seed for operations
--time-to-first-token int 0 Time to first token in milliseconds
--time-to-first-token-std-dev int 0 Standard deviation for time to first token in milliseconds (max 30% of TTFT)
--inter-token-latency int 0 Time to generate one token in milliseconds
--inter-token-latency-std-dev int 0 Standard deviation for inter-token latency in milliseconds (max 30% of ITL)
--prefill-overhead int 0 Time to prefill in milliseconds. Ignored if --time-to-first-token is not 0
--prefill-time-per-token int 0 Time to prefill per token in milliseconds
--prefill-time-std-dev int 0 Standard deviation for prefill time in milliseconds
--kv-cache-transfer-latency int 0 Time for KV-cache transfer from a remote vLLM in milliseconds (P/D mode)
--kv-cache-transfer-latency-std-dev int 0 Standard deviation for KV-cache transfer latency in milliseconds
--kv-cache-transfer-time-per-token int 0 Time for KV-cache transfer per token from a remote vLLM in milliseconds
--kv-cache-transfer-time-std-dev int 0 Standard deviation for KV-cache transfer time per token in milliseconds
--time-factor-under-load float64 1.0 Multiplicative factor affecting request time when parallel requests are processed (must be >= 1.0)
--enable-kvcache bool false Enables KV cache feature
--kv-cache-size int 1024 Maximum number of token blocks in KV cache
--block-size int 16 Token block size for contiguous chunks (valid: 8, 16, 32, 64, 128)
--tokenizers-cache-dir string "" Directory for caching tokenizers
--hash-seed string "" Seed for hash generation (falls back to PYTHONHASHSEED env var)
--zmq-endpoint string tcp://localhost:5557 ZMQ address to publish events
--zmq-max-connect-attempts uint 0 Maximum number of times to try ZMQ connect (max 10)
--event-batch-size int 16 Maximum number of KV-cache events to be sent together
--data-parallel-size int 1 Number of ranks to run (1-8)
--failure-injection-rate int 0 Probability (0-100) of injecting failures
--failure-types []string [] Specific failure types to inject: rate_limit, invalid_api_key, context_length, server_error, invalid_request, model_not_found
--fake-metrics string "" JSON metrics to report to Prometheus instead of real metrics
--max-tool-call-integer-param int 100 Maximum possible value of integer parameters in a tool call
--min-tool-call-integer-param int 0 Minimum possible value of integer parameters in a tool call
--max-tool-call-number-param float64 100 Maximum possible value of number (float) parameters in a tool call
--min-tool-call-number-param float64 0 Minimum possible value of number (float) parameters in a tool call
--max-tool-call-array-param-length int 5 Maximum possible length of array parameters in a tool call
--min-tool-call-array-param-length int 1 Minimum possible length of array parameters in a tool call
--tool-call-not-required-param-probability int 50 Probability (0-100) to add a non-required parameter in a tool call
--object-tool-call-not-required-field-probability int 50 Probability (0-100) to add a non-required field in an object in a tool call
Environment Variables
Variable Description
POD_NAME Pod name of simulator
POD_NAMESPACE Namespace where simulator is running
PYTHONHASHSEED Fallback seed for hash generation if --hash-seed is not set
Key Differences Between Main and v0.5.0
  1. Duration Parameters: In main branch, latency-related parameters (--time-to-first-token, --inter-token-latency, etc.) use Go duration strings (e.g., "100ms", "1.5s"). In v0.5.0, these are integers representing milliseconds.

  2. New Flags in Main: --max-waiting-queue-length, --global-cache-hit-threshold, --data-parallel-rank, --ssl-certfile, --ssl-keyfile, --self-signed-certs, --dataset-path, --dataset-url, --dataset-in-memory, --enable-sleep-mode, --enable-request-id-headers, --latency-calculator.

  3. Environment Variables: Main branch adds POD_IP and VLLM_SERVER_DEV_MODE.