How to run and test the CLI locally using the built-in echo server and the included dummy dataset, without a real inference endpoint.
Dataset: The repo includes tests/datasets/dummy_1k.jsonl (1000 samples)
Format: Automatically inferred from the file extension. Common local formats include jsonl, json, csv, parquet, and HuggingFace datasets.
The echo server is included for local testing and mirrors requests back as responses.
# Terminal 1: Start echo server on port 8765
python3 -m inference_endpoint.testing.echo_server --port 8765
# Or use default port 12345
python3 -m inference_endpoint.testing.echo_serverThe server will log:
Server ready on port 8765
Server is running. Press Ctrl+C to stop...
# Terminal 2: Test probe command
inference-endpoint -v probe \
--endpoints http://localhost:8765 \
--model Qwen/Qwen3-8B \
--requests 5
# With custom prompt and model
inference-endpoint -v probe \
--endpoints http://localhost:8765 \
--model Qwen/Qwen3-8B \
--requests 10 \
--prompt "Tell me a joke in 20 words"Expected Output:
Probing: http://localhost:8765
Sending 5 requests...
Issued 1/5 requests
...
Issued 5/5 requests
Waiting for 5 responses...
Processed 5/5 responses
✓ Completed: 5/5 successful
✓ Avg latency: 184ms
✓ Range: 184ms - 184ms
✓ Sample responses (5 collected):
[probe-0] Please write me a joke in 30 words.
[probe-1] Please write me a joke in 30 words.
...
✓ Probe successful
# Quick test (model is required)
inference-endpoint -v benchmark offline \
--endpoints http://localhost:8765 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.jsonl \
--duration 0
# Production test with custom params and report generation
inference-endpoint -v benchmark offline \
--endpoints http://localhost:8765 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.jsonl \
--num-samples 5000 \
--workers 4 \
--report-dir benchmark_report
# Note: Set HF_TOKEN environment variable if using non-public models
# export HF_TOKEN=your_huggingface_tokenExpected Output:
Loading: dummy_1k.jsonl
Loaded 1000 samples
Mode: TestMode.PERF, QPS: 10.0, Responses: False
Streaming: disabled (auto, offline mode)
Min Duration: 0.0s, Expected samples: 1000
Scheduler: MaxThroughputScheduler (pattern: max_throughput)
Connecting: http://localhost:8765
Running...
Completed in 0.5s
Results: 1000/1000 successful
Estimated QPS: 2000.0
Cleaning up...
# Test sustained QPS with latency focus
inference-endpoint -v benchmark online \
--endpoints http://localhost:8765 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.jsonl \
--duration 0 \
--load-pattern poisson \
--target-qps 100 \
--report-dir online_benchmark_reportExpected Output:
Loading: dummy_1k.jsonl
Loaded 1000 samples
Mode: TestMode.PERF, QPS: 100.0, Responses: False
Streaming: enabled (auto, online mode)
Min Duration: 0.0s, Expected samples: 1000
Scheduler: PoissonDistributionScheduler (pattern: poisson)
Connecting: http://localhost:8765
Running...
Completed in 10.0s
Results: 1000/1000 successful
Estimated QPS: 100.0
Cleaning up...
# Show info
inference-endpoint -v info
# Generate template
inference-endpoint init offline
# Validate config
inference-endpoint validate-yaml --config offline_template.yaml
# Test with existing dataset
inference-endpoint benchmark offline \
--endpoints http://localhost:8765 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/ds_samples.jsonl \
-vA report directory is always created (at --report-dir if specified, or at a default path
otherwise), containing benchmark artifacts: result_summary.json, runtime_settings.json,
sample_idx_map.json, report.txt, and events.jsonl.
Press Ctrl+C in the terminal running the echo server, or:
pkill -f echo_server# Custom host and port
python3 -m inference_endpoint.testing.echo_server --host 0.0.0.0 --port 9000
# Check help
python3 -m inference_endpoint.testing.echo_server --helpThe echo server expects OpenAI-compatible format but simplifies it:
What workers send (internal):
{
"prompt": "Your query text",
"model": "model-name",
"max_completion_tokens": 50,
"stream": false
}The HTTP client's OpenAI adapter converts this to proper OpenAI format with messages array internally.
Error: Connection failed
Solution: Ensure echo server is running and port is correct
Error: prompt not found in query.data
Solution: Use "prompt" format in Query data, not "messages" (client converts it)
Error: Timeout (>60s)
Solution: Echo server might not be running, check logs at /tmp/echo_server.log
# 1. Start echo server
python3 -m inference_endpoint.testing.echo_server --port 8000 &
# 2. Generate fresh dataset if needed
python scripts/create_dummy_dataset.py
# 3. Set HF_TOKEN if using non-public models (optional)
export HF_TOKEN=your_huggingface_token
# 4. Test probe first
inference-endpoint probe --endpoints http://localhost:8000 --model Qwen/Qwen3-8B --requests 10
# 5. Run benchmark with report generation
inference-endpoint -v benchmark offline \
--endpoints http://localhost:8000 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.jsonl \
--workers 4 \
--report-dir benchmark_report
# 6. Stop server
pkill -f echo_server# Offline (max throughput)
inference-endpoint benchmark offline \
--endpoints http://localhost:8765 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.jsonl \
--report-dir offline_report
# Online (Poisson distribution)
inference-endpoint benchmark online \
--endpoints http://localhost:8765 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.jsonl \
--load-pattern poisson \
--target-qps 500 \
--report-dir online_report
# With explicit sample count
inference-endpoint benchmark offline \
--endpoints http://localhost:8765 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.jsonl \
--num-samples 500
# Force streaming on for offline mode (to test TTFT metrics)
inference-endpoint benchmark offline \
--endpoints http://localhost:8765 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.jsonl \
--streaming on
# Concurrency mode (fixed concurrent requests)
inference-endpoint benchmark online \
--endpoints http://localhost:8765 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.jsonl \
--load-pattern concurrency \
--concurrency 32Key Requirements:
- Model name is required for all benchmark and probe commands
- Online mode requires
--load-patternto specify the scheduler type (poisson or concurrency)--load-pattern poissonrequires--target-qps--load-pattern concurrencyrequires--concurrency
- Set
HF_TOKENenvironment variable for non-public models (public models like Qwen/Qwen3-8B don't need it)
Sample Count Control:
- Use
--duration 0when you want a local test to stop after exhausting the dataset instead of running for the default timed duration - Sample priority:
--num-samples> dataset size (when--duration 0) > calculated (target_qps × duration) - Default duration: 600000ms (10 minutes)
Testing & Debugging:
- Use
-vfor INFO logging,-vvfor DEBUG - Echo server mirrors prompts back - perfect for quick testing without real inference
- Press
Ctrl+Cto gracefully interrupt benchmarks - Default test dataset:
tests/datasets/dummy_1k.jsonl(1000 samples)
Advanced:
- Streaming:
auto(default),on, oroff- auto enables for online, disables for offline - Use
--report-dirfor detailed metrics reports with TTFT, TPOT, and token analysis - Dataset format auto-inferred from file extension