Skip to content

Commit cf59767

Browse files
authored
Merge pull request #210 from WasmEdge/hydai/fix_llm
2 parents 709c0a5 + 37aed90 commit cf59767

File tree

1 file changed

+60
-25
lines changed

1 file changed

+60
-25
lines changed

docs/develop/rust/wasinn/llm_inference.md

+60-25
Original file line numberDiff line numberDiff line change
@@ -4,26 +4,61 @@ sidebar_position: 1
44

55
# Llama 2 inference
66

7-
WasmEdge now supports running llama2 series of models in Rust. We will use [this example project](https://github.com/second-state/llama-utils/tree/main/chat) to show how to make AI inferences with the llama2 model in WasmEdge and Rust.
8-
9-
WasmEdge now supports Llama2, Codellama-instruct, BELLE-Llama, Mistral-7b-instruct, Wizard-vicuna, OpenChat 3.5B and raguile-chatml.
7+
WasmEdge now supports running llama2 series of models in Rust. We will use [this example project](https://github.com/second-state/LlamaEdge/tree/main/chat) to show how to make AI inferences with the llama2 model in WasmEdge and Rust.
8+
9+
WasmEdge now supports the following models:
10+
11+
1. Llama-2-7B-Chat
12+
1. Llama-2-13B-Chat
13+
1. CodeLlama-13B-Instruct
14+
1. Mistral-7B-Instruct-v0.1
15+
1. Mistral-7B-Instruct-v0.2
16+
1. MistralLite-7B
17+
1. OpenChat-3.5-0106
18+
1. OpenChat-3.5-1210
19+
1. OpenChat-3.5
20+
1. Wizard-Vicuna-13B-Uncensored-GGUF
21+
1. TinyLlama-1.1B-Chat-v1.0
22+
1. Baichuan2-13B-Chat
23+
1. OpenHermes-2.5-Mistral-7B
24+
1. Dolphin-2.2-Yi-34B
25+
1. Dolphin-2.6-Mistral-7B
26+
1. Samantha-1.2-Mistral-7B
27+
1. Samantha-1.11-CodeLlama-34B
28+
1. WizardCoder-Python-7B-V1.0
29+
1. Zephyr-7B-Alpha
30+
1. WizardLM-13B-V1.0-Uncensored
31+
1. Orca-2-13B
32+
1. Neural-Chat-7B-v3-1
33+
1. Yi-34B-Chat
34+
1. Starling-LM-7B-alpha
35+
1. DeepSeek-Coder-6.7B
36+
1. DeepSeek-LLM-7B-Chat
37+
1. SOLAR-10.7B-Instruct-v1.0
38+
1. Mixtral-8x7B-Instruct-v0.1
39+
1. Nous-Hermes-2-Mixtral-8x7B-DPO
40+
1. Nous-Hermes-2-Mixtral-8x7B-SFT
41+
42+
And more, please check [the supported models](https://github.com/second-state/LlamaEdge/blob/main/models.md) for detials.
1043

1144
## Prerequisite
1245

1346
Besides the [regular WasmEdge and Rust requirements](../../rust/setup.md), please make sure that you have the [Wasi-NN plugin with ggml installed](../../../start/install.md#wasi-nn-plug-in-with-ggml-backend).
1447

1548
## Quick start
1649

17-
Because the example already includes a compiled WASM file from the Rust code, we could use WasmEdge CLI to execute the example directly. First, git clone the `llama-utils` repo.
50+
Because the example already includes a compiled WASM file from the Rust code, we could use WasmEdge CLI to execute the example directly.
51+
52+
First, get the latest llama-chat wasm application
1853

1954
```bash
20-
curl -LO https://github.com/second-state/llama-utils/raw/main/chat/llama-chat.wasm
55+
curl -LO https://github.com/second-state/LlamaEdge/releases/latest/download/llama-chat.wasm
2156
```
2257

23-
Next, let's get the model. In this example, we are going to use the llama2 7b chat model in GGUF format. You can also use other kinds of llama2 models, check out [here](https://github.com/second-state/llama-utils/blob/main/chat/README.md#get-model).
58+
Next, let's get the model. In this example, we are going to use the llama2 7b chat model in GGUF format. You can also use other kinds of llama2 models, check out [here](https://github.com/second-state/llamaedge/blob/main/chat/README.md#get-model).
2459

2560
```bash
26-
curl -LO https://huggingface.co/wasmedge/llama2/blob/main/llama-2-7b-chat-q5_k_m.gguf
61+
curl -LO https://huggingface.co/wasmedge/llama2/resolve/main/llama-2-7b-chat-q5_k_m.gguf
2762
```
2863

2964
Run the inference application in WasmEdge.
@@ -47,10 +82,10 @@ The total cost of four apples is 20 dollars.
4782

4883
## Build and run
4984

50-
Let's build the wasm file from the rust source code. First, git clone the `llama-utils` repo.
85+
Let's build the wasm file from the rust source code. First, git clone the `llamaedge` repo.
5186

5287
```bash
53-
git clone https://github.com/second-state/llama-utils.git
88+
git clone https://github.com/second-state/llamaedge.git
5489
cd chat
5590
```
5691

@@ -108,18 +143,18 @@ You can configure the chat inference application through CLI options.
108143
Print help
109144
```
110145
111-
The `--prompt-template` option is perhaps the most interesting. It allows the application to support different open source LLM models beyond llama2.
146+
The `--prompt-template` option is perhaps the most interesting. It allows the application to support different open source LLM models beyond llama2.
112147
113-
| Template name | Model | Download |
114-
| ------------ | ------------------------------ | --- |
115-
| llama-2-chat | [The standard llama2 chat model](https://ai.meta.com/llama/) | [7b](https://huggingface.co/wasmedge/llama2/resolve/main/llama-2-7b-chat-q5_k_m.gguf) |
116-
| codellama-instruct | [CodeLlama](https://about.fb.com/news/2023/08/code-llama-ai-for-coding/) | [7b](https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GGUF/resolve/main/codellama-7b-instruct.Q5_K_M.gguf) |
117-
| mistral-instruct-v0.1 | [Mistral](https://mistral.ai/) | [7b](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q5_K_M.gguf) |
118-
| mistrallite | [Mistral Lite](https://huggingface.co/amazon/MistralLite) | [7b](https://huggingface.co/TheBloke/MistralLite-7B-GGUF/resolve/main/mistrallite.Q5_K_M.gguf) |
119-
| openchat | [OpenChat](https://github.com/imoneoi/openchat) | [7b](https://huggingface.co/TheBloke/openchat_3.5-GGUF/resolve/main/openchat_3.5.Q5_K_M.gguf) |
120-
| belle-llama-2-chat | [BELLE](https://github.com/LianjiaTech/BELLE) | [13b](https://huggingface.co/second-state/BELLE-Llama2-13B-Chat-0.4M-GGUF/resolve/main/BELLE-Llama2-13B-Chat-0.4M-ggml-model-q4_0.gguf) |
121-
| vicuna-chat | [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) | [7b](https://huggingface.co/TheBloke/vicuna-7B-v1.5-GGUF/resolve/main/vicuna-7b-v1.5.Q5_K_M.gguf) |
122-
| chatml | [ChatML](https://huggingface.co/chargoddard/rpguild-chatml-13b) | [13b](https://huggingface.co/TheBloke/rpguild-chatml-13B-GGUF/resolve/main/rpguild-chatml-13b.Q5_K_M.gguf) |
148+
| Template name | Model | Download |
149+
| ------------ | ------------------------------ | --- |
150+
| llama-2-chat | [The standard llama2 chat model](https://ai.meta.com/llama/) | [7b](https://huggingface.co/wasmedge/llama2/resolve/main/llama-2-7b-chat-q5_k_m.gguf) |
151+
| codellama-instruct | [CodeLlama](https://about.fb.com/news/2023/08/code-llama-ai-for-coding/) | [7b](https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GGUF/resolve/main/codellama-7b-instruct.Q5_K_M.gguf) |
152+
| mistral-instruct-v0.1 | [Mistral](https://mistral.ai/) | [7b](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q5_K_M.gguf) |
153+
| mistrallite | [Mistral Lite](https://huggingface.co/amazon/MistralLite) | [7b](https://huggingface.co/TheBloke/MistralLite-7B-GGUF/resolve/main/mistrallite.Q5_K_M.gguf) |
154+
| openchat | [OpenChat](https://github.com/imoneoi/openchat) | [7b](https://huggingface.co/TheBloke/openchat_3.5-GGUF/resolve/main/openchat_3.5.Q5_K_M.gguf) |
155+
| belle-llama-2-chat | [BELLE](https://github.com/LianjiaTech/BELLE) | [13b](https://huggingface.co/second-state/BELLE-Llama2-13B-Chat-0.4M-GGUF/resolve/main/BELLE-Llama2-13B-Chat-0.4M-ggml-model-q4_0.gguf) |
156+
| vicuna-chat | [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) | [7b](https://huggingface.co/TheBloke/vicuna-7B-v1.5-GGUF/resolve/main/vicuna-7b-v1.5.Q5_K_M.gguf) |
157+
| chatml | [ChatML](https://huggingface.co/chargoddard/rpguild-chatml-13b) | [13b](https://huggingface.co/TheBloke/rpguild-chatml-13B-GGUF/resolve/main/rpguild-chatml-13b.Q5_K_M.gguf) |
123158
124159
125160
Furthermore, the following command tells WasmEdge to print out logs and statistics of the model at runtime.
@@ -155,11 +190,11 @@ wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf l
155190
156191
## Understand the code
157192
158-
The [main.rs](https://github.com/second-state/llama-utils/blob/main/chat/src/main.rs) is the full Rust code to create an interactive chatbot using a LLM. The Rust program manages the user input, tracks the conversation history, transforms the text into the llama2 and other model’s chat templates, and runs the inference operations using the WASI NN standard API. The code logic for the chat interaction is somewhat complex. In this section, we will use the [simple example](https://github.com/second-state/llama-utils/tree/main/simple) to explain how to set up and perform one inference round trip. Here is how you use the simple example.
193+
The [main.rs](https://github.com/second-state/llamaedge/blob/main/chat/src/main.rs) is the full Rust code to create an interactive chatbot using a LLM. The Rust program manages the user input, tracks the conversation history, transforms the text into the llama2 and other model’s chat templates, and runs the inference operations using the WASI NN standard API. The code logic for the chat interaction is somewhat complex. In this section, we will use the [simple example](https://github.com/second-state/llamaedge/tree/main/simple) to explain how to set up and perform one inference round trip. Here is how you use the simple example.
159194
160195
```bash
161196
# Download the compiled simple inference wasm
162-
curl -LO https://github.com/second-state/llama-utils/raw/main/simple/llama-simple.wasm
197+
curl -LO https://github.com/second-state/llamaedge/releases/latest/download/llama-simple.wasm
163198
164199
# Give it a prompt and ask it to use the model to complete it.
165200
wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf llama-simple.wasm \
@@ -168,7 +203,7 @@ wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf ll
168203
output: in 1942, when he led the team that developed the first atomic bomb, which was dropped on Hiroshima, Japan in 1945.
169204
```
170205
171-
First, let's parse command line arguments to customize the chatbot's behavior using `Command` struct. It extracts the following parameters: `prompt` (a prompt that guides the conversation), `model_alias` (a list for the loaded model), and `ctx_size` (the size of the chat context).
206+
First, let's parse command line arguments to customize the chatbot's behavior using `Command` struct. It extracts the following parameters: `prompt` (a prompt that guides the conversation), `model_alias` (a list for the loaded model), and `ctx_size` (the size of the chat context).
172207
173208
```rust
174209
fn main() -> Result<(), String> {
@@ -272,7 +307,7 @@ println!("\noutput: {}", output);
272307
273308
## Resources
274309
275-
* If you're looking for multi-turn conversations with llama 2 models, please check out the above mentioned chat example source code [here](https://github.com/second-state/llama-utils/tree/main/chat).
276-
* If you want to construct OpenAI-compatible APIs specifically for your llama2 model, or the Llama2 model itself, please check out the source code [for the API server](https://github.com/second-state/llama-utils/tree/main/api-server).
310+
* If you're looking for multi-turn conversations with llama 2 models, please check out the above mentioned chat example source code [here](https://github.com/second-state/llamaedge/tree/main/chat).
311+
* If you want to construct OpenAI-compatible APIs specifically for your llama2 model, or the Llama2 model itself, please check out the source code [for the API server](https://github.com/second-state/llamaedge/tree/main/api-server).
277312
* To learn more, please check out [this article](https://medium.com/stackademic/fast-and-portable-llama2-inference-on-the-heterogeneous-edge-a62508e82359).
278313

0 commit comments

Comments
 (0)