PyLet supports TOML configuration files for defining instance configurations. See configs/ for examples.
# Submit using config file
pylet submit --config configs/inference.toml
# Override config values with CLI args
pylet submit --config configs/inference.toml --gpu-units 0 # CPU-only for testing
# Config file with additional command
pylet submit --config configs/simple.toml echo "extra args"# job.toml - Example config
name = "my-instance" # Optional, defaults to filename
command = ["python", "train.py"] # Array format (recommended)
# command = "python train.py" # String format also works
[resources]
gpus = 1 # GPU count (auto-allocated)
cpus = 4 # CPU cores
memory = "16Gi" # Memory (Gi, Mi, Ki units)
[env]
HF_TOKEN = "${HF_TOKEN}" # Interpolate from shell environment
STATIC_VAR = "fixed_value" # Static value
[labels]
type = "inference" # Custom metadataWhen the same setting is specified in multiple places, the highest priority wins:
| Priority | Source | Example |
|---|---|---|
| 1 (highest) | CLI Arguments | --gpu-units 0 |
| 2 | Config File | gpus = 1 in job.toml |
| 3 (lowest) | Defaults | gpus = 0 |
See start_vllm.py for a complete example of using PyLet from Python:
from pylet.client import PyletClient
async def main():
client = PyletClient("http://localhost:8000")
# Submit instance
instance_id = await client.submit_instance(
command="vllm serve Qwen/Qwen2.5-1.5B-Instruct --port $PORT",
resource_requirements={"cpu_cores": 1, "gpu_units": 1, "memory_mb": 4096},
name="my-vllm",
)
# Wait for RUNNING, then get endpoint
endpoint = await client.get_instance_endpoint(instance_id)
# Send requests to http://{endpoint}/v1/completions
# ...
# Cleanup
await client.cancel_instance(instance_id)
await client.close()- vLLM installed:
pip install vllm - A machine with GPU(s)
# Terminal 1: Start head node
pylet start
# Terminal 2: Start worker with GPU(s)
pylet start --head localhost:8000 --gpu-units 1
# Terminal 3: Submit vLLM instance
# Use $PORT so vLLM binds to the worker-allocated port
pylet submit 'vllm serve Qwen/Qwen2.5-1.5B-Instruct --port $PORT' \
--gpu-units 1 --name vllm-test
# Check instance status
pylet get-instance --name vllm-test
# Get endpoint (wait for RUNNING status)
pylet get-endpoint --name vllm-test
# Output: 192.168.1.10:15600
# Test inference
curl http://<endpoint>/v1/models
curl http://<endpoint>/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","prompt":"Hello","max_tokens":50}'
# View logs (streams in real-time)
pylet logs <instance-id>
pylet logs <instance-id> --follow
# Cancel instance (graceful shutdown)
pylet cancel <instance-id>| Feature | How it works |
|---|---|
| Port | Worker sets PORT env var (15600-15700). Use --port $PORT in your command. |
| GPU | Worker sets CUDA_VISIBLE_DEVICES based on allocated GPUs. |
| Endpoint | pylet get-endpoint returns worker_ip:port for client access. |
| Logs | Captured via sidecar, available in real-time via pylet logs. |
| Cancel | Sends SIGTERM, waits 30s grace period, then SIGKILL if needed. |
pylet submit 'python -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --port $PORT' \
--gpu-units 1 --name sglang-test# Worker with 4 GPUs
pylet start --head localhost:8000 --gpu-units 4
# Request 2 GPUs for tensor parallelism
pylet submit 'vllm serve meta-llama/Llama-3.1-70B-Instruct --port $PORT --tensor-parallel-size 2' \
--gpu-units 2 --name llama70b