Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 112 additions & 1 deletion examples/llm-edge-benchmark-suite/single_task_bench/README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,113 @@
Large Language Model Edge Benchmark Suite: Implementation on KubeEdge-lanvs
# llm-edge-benchmark-suite single_task_bench

This guide outlines the complete setup, configuration, and execution process for running the Large Language Model (LLM) benchmarking suite using the [Ianvs](https://github.com/kubeedge/ianvs) edge computing framework.

This specific environment is configured to run the **Single Task Learning** paradigm, evaluating LLM inference performance (latency, throughput, and Time-To-First-Token) using `llama-cpp-python` with quantized models (e.g., Qwen 1.5 0.5B GGUF).

---

## 📋 Prerequisites

Before running the benchmark, ensure you have the following ready:
1. **Ianvs Framework**: Installed and configured.
2. **Virtual Environment**: Your active Ianvs virtual environment (e.g., `ianvs_env`).
3. **C++ Build Tools**: Required for compiling `llama.cpp` bindings (e.g., `build-essential` on Ubuntu).

---

## 🛠️ Step 1: Environment Setup

First, activate your Ianvs virtual environment:
```bash
source /path/to/your/ianvs_env/bin/activate
```

Navigate to the benchmark directory:
```bash
cd /home/nishant/LOCAL_DISK_D/ianvs/examples/llm-edge-benchmark-suite/single_task_bench
```

Install the required dependencies using the provided `requirements.txt`:
```bash
pip install -r requirements.txt
```
*(Note: If `requirements.txt` is missing, ensure you install `llama-cpp-python>=0.2.20`, `torch`, `transformers`, `pyyaml`, and `psutil`).*

---

## 📥 Step 2: Download the Model

The benchmark requires a localized `.gguf` model file. By default, this suite uses the `Qwen1.5-0.5B-Chat` model.

Create the target directory and use a resumable download command (`wget -c`) to prevent file corruption:

```bash
mkdir -p /home/nishant/LOCAL_DISK_D/ianvs/models/qwen
wget -c -O /home/nishant/LOCAL_DISK_D/ianvs/models/qwen/qwen_1_5_0_5b.gguf [https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat-GGUF/resolve/main/qwen1_5-0_5b-chat-q4_k_m.gguf](https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat-GGUF/resolve/main/qwen1_5-0_5b-chat-q4_k_m.gguf)
```
**Verification:** Ensure the downloaded file is approximately `398MB` to confirm it is not corrupted.

---

## ⚙️ Step 3: Configuration Alignment

Ianvs requires strict configuration alignment. Double-check the following YAML files to ensure paths and paradigms are correct:

### 1. Test Environment (`testenv/testenv.yaml`)
Ensure all dataset paths are set to **absolute paths**. Relative paths will cause parsing errors.
```yaml
dataset:
train_data: "/home/nishant/LOCAL_DISK_D/ianvs/dataset/data.jsonl" # Must be absolute
```

### 2. Algorithm Configuration (`testalgorithms/algorithm.yaml`)
Ensure the `paradigm_type` is correctly set for standard single-task inference, and the model path is absolute:
```yaml
paradigm_type: singletasklearning # Do NOT use 'singletasklearningwithcompression' here
modules:
basemodel:
model_path: "/home/nishant/LOCAL_DISK_D/ianvs/models/qwen/qwen_1_5_0_5b.gguf"
```

---

## 🧩 Step 4: Algorithm Script (`basemodel.py`)

The Ianvs `SingleTaskLearning` paradigm requires the model class to strictly adhere to the machine learning lifecycle contract. Your `LlamaCppModel` class in `basemodel.py` must include:

1. **Pipeline Methods:** `preprocess` and `postprocess` with optional arguments to prevent `TypeError` exceptions.
2. **Training Bypass:** A safe no-op `train` method, since we are using pre-trained weights for inference.
3. **TTFT Measurement:** The `predict` method must use `stream=True` to accurately measure `prefill_latency` (Time-to-First-Token).

*Example Snippet of required methods:*
```python
def preprocess(self, data=None, **kwargs):
return data

def postprocess(self, predict_output=None, **kwargs):
return predict_output

def train(self, train_data, valid_data=None, **kwargs):
return kwargs.get("model_path", "")
```

---

## 🚀 Step 5: Execution

Once the setup and configurations are validated, run the benchmarking job from your terminal:

```bash
ianvs -f /home/nishant/LOCAL_DISK_D/ianvs/examples/llm-edge-benchmark-suite/single_task_bench/benchmarkingjob.yaml
```

### Expected Output
The Ianvs core will parse the configurations, load the `LlamaCppModel`, and execute the inference loop. Upon completion, a `workspace` directory will be generated containing the logs and a final leaderboard table (`rank.csv`).

You should see an output table similar to this:
```text
+------+-----------+---------+------------+-----------------+--------------------+---------------+
| rank | algorithm | latency | throughput | prefill_latency | paradigm | basemodel |
+------+-----------+---------+------------+-----------------+--------------------+---------------+
| 1 | llama-cpp | 171.29 | 0.0058 | 171.27 | singletasklearning | LlamaCppModel |
+------+-----------+---------+------------+-----------------+--------------------+---------------+
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# LLM Core Execution
llama-cpp-python>=0.2.20

# Machine Learning & Neural Network Basics
torch>=2.0.0
transformers>=4.35.0
numpy>=1.24.0

# Ianvs Utilities & Data Handling
pyyaml>=6.0
pandas>=2.0.0
requests>=2.31.0
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ def __init__(self, **kwargs):
quantization_type = kwargs.get("quantization_type", None)
if quantization_type:
logging.info(f"Using quantization type: {quantization_type}")

# Init LLM model
self.model = Llama(
model_path=model_path,
Expand All @@ -35,22 +36,24 @@ def __init__(self, **kwargs):
embedding=kwargs.get("embedding", False),
)

# 1. FIXED: Optional arguments for Ianvs pipeline
def preprocess(self, data=None, **kwargs):
"""
Pass-through for text data.
"""
return data

def predict(self, data, input_shape=None, **kwargs):
data = data[:10]
process = psutil.Process(os.getpid())
start_time = time.time()

results = []
total_times = []
prefill_latencies = []
mem_usages = []

for prompt in data:
prompt_start_time = time.time()

f = io.StringIO()
with redirect_stderr(f):
output = self.model(
data = data[:10]
process = psutil.Process(os.getpid())

results = []

for prompt in data:
prompt_start_time = time.time()

# Run model with stream=True to measure exact TTFT
output_stream = self.model(
prompt=prompt,
max_tokens=kwargs.get("max_tokens", 32),
stop=kwargs.get("stop", ["Q:", "\n"]),
Expand All @@ -59,31 +62,44 @@ def predict(self, data, input_shape=None, **kwargs):
top_p=kwargs.get("top_p", 0.95),
top_k=kwargs.get("top_k", 40),
repeat_penalty=kwargs.get("repeat_penalty", 1.1),
stream=True # <--- This is the magic flag
)
stdout_output = f.getvalue()

# parse timing info
timings = self._parse_timings(stdout_output)
prefill_latency = timings.get('prompt_eval_time', 0.0) # ms
generated_text = output['choices'][0]['text']

prompt_end_time = time.time()
prompt_total_time = (prompt_end_time - prompt_start_time) * 1000 # convert to ms

result_with_time = {
"generated_text": generated_text,
"total_time": prompt_total_time,
"prefill_latency": prefill_latency,
"mem_usage":process.memory_info().rss,
}

results.append(result_with_time)

predict_dict = {
"results": results,
}

return predict_dict

generated_text = ""
prefill_latency = 0.0
first_token = True

# Iterate through the stream as the model generates it
for chunk in output_stream:
if first_token:
# The time difference right here is your prefill latency!
prefill_latency = (time.time() - prompt_start_time) * 1000
first_token = False

# Piece the text back together
if "text" in chunk["choices"][0]:
generated_text += chunk["choices"][0]["text"]

prompt_end_time = time.time()
prompt_total_time = (prompt_end_time - prompt_start_time) * 1000 # convert to ms

result_with_time = {
"generated_text": generated_text,
"total_time": prompt_total_time,
"prefill_latency": prefill_latency,
"mem_usage": process.memory_info().rss,
}

results.append(result_with_time)

return {"results": results}

# 2. FIXED: Optional arguments for Ianvs pipeline
def postprocess(self, predict_output=None, **kwargs):
"""
Pass-through for prediction output.
"""
return predict_output

def _parse_timings(self, stdout_output):
import re
Expand Down Expand Up @@ -131,5 +147,11 @@ def save(self, model_path):
def load(self, model_url):
pass

# 3. FIXED: Safe no-op for training pre-trained models
def train(self, train_data, valid_data=None, **kwargs):
return
"""
Dummy train method.
Returns the model path to satisfy Ianvs pipeline requirements.
"""
logging.info("Training step bypassed: Using pre-trained weights for LLM inference.")
return kwargs.get("model_path", "")
Comment thread
NishantSinghhhhh marked this conversation as resolved.
Loading