Skip to content

Latest commit

 

History

History
146 lines (106 loc) · 4 KB

File metadata and controls

146 lines (106 loc) · 4 KB

PyLet Examples

Configuration Files

PyLet supports TOML configuration files for defining instance configurations. See configs/ for examples.

Using Config Files

# Submit using config file
pylet submit --config configs/inference.toml

# Override config values with CLI args
pylet submit --config configs/inference.toml --gpu-units 0  # CPU-only for testing

# Config file with additional command
pylet submit --config configs/simple.toml echo "extra args"

Config File Format

# job.toml - Example config
name = "my-instance"              # Optional, defaults to filename
command = ["python", "train.py"]  # Array format (recommended)
# command = "python train.py"     # String format also works

[resources]
gpus = 1                          # GPU count (auto-allocated)
cpus = 4                          # CPU cores
memory = "16Gi"                   # Memory (Gi, Mi, Ki units)

[env]
HF_TOKEN = "${HF_TOKEN}"          # Interpolate from shell environment
STATIC_VAR = "fixed_value"        # Static value

[labels]
type = "inference"                # Custom metadata

Precedence

When the same setting is specified in multiple places, the highest priority wins:

Priority Source Example
1 (highest) CLI Arguments --gpu-units 0
2 Config File gpus = 1 in job.toml
3 (lowest) Defaults gpus = 0

Python API

See start_vllm.py for a complete example of using PyLet from Python:

from pylet.client import PyletClient

async def main():
    client = PyletClient("http://localhost:8000")

    # Submit instance
    instance_id = await client.submit_instance(
        command="vllm serve Qwen/Qwen2.5-1.5B-Instruct --port $PORT",
        resource_requirements={"cpu_cores": 1, "gpu_units": 1, "memory_mb": 4096},
        name="my-vllm",
    )

    # Wait for RUNNING, then get endpoint
    endpoint = await client.get_instance_endpoint(instance_id)

    # Send requests to http://{endpoint}/v1/completions
    # ...

    # Cleanup
    await client.cancel_instance(instance_id)
    await client.close()

Running vLLM on PyLet

Prerequisites

  • vLLM installed: pip install vllm
  • A machine with GPU(s)

Step-by-step

# Terminal 1: Start head node
pylet start

# Terminal 2: Start worker with GPU(s)
pylet start --head localhost:8000 --gpu-units 1

# Terminal 3: Submit vLLM instance
# Use $PORT so vLLM binds to the worker-allocated port
pylet submit 'vllm serve Qwen/Qwen2.5-1.5B-Instruct --port $PORT' \
    --gpu-units 1 --name vllm-test

# Check instance status
pylet get-instance --name vllm-test

# Get endpoint (wait for RUNNING status)
pylet get-endpoint --name vllm-test
# Output: 192.168.1.10:15600

# Test inference
curl http://<endpoint>/v1/models
curl http://<endpoint>/v1/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","prompt":"Hello","max_tokens":50}'

# View logs (streams in real-time)
pylet logs <instance-id>
pylet logs <instance-id> --follow

# Cancel instance (graceful shutdown)
pylet cancel <instance-id>

Key behaviors

Feature How it works
Port Worker sets PORT env var (15600-15700). Use --port $PORT in your command.
GPU Worker sets CUDA_VISIBLE_DEVICES based on allocated GPUs.
Endpoint pylet get-endpoint returns worker_ip:port for client access.
Logs Captured via sidecar, available in real-time via pylet logs.
Cancel Sends SIGTERM, waits 30s grace period, then SIGKILL if needed.

Running SGLang

pylet submit 'python -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --port $PORT' \
    --gpu-units 1 --name sglang-test

Multi-GPU instance

# Worker with 4 GPUs
pylet start --head localhost:8000 --gpu-units 4

# Request 2 GPUs for tensor parallelism
pylet submit 'vllm serve meta-llama/Llama-3.1-70B-Instruct --port $PORT --tensor-parallel-size 2' \
    --gpu-units 2 --name llama70b