Skip to content

Commit 4b32284

Browse files
mikekgfbmalfet
authored andcommitted
Update README.md (#112)
Update readme Update README.md (#113) update README.md Update README.md (#114) Update README.md (#115) Update Readme.md Update README.md (#116) Update README.md Update README.md (#118) Update README.md Update README.md (#121) Update REAME based on #107 Update README.md (#123) Add information about params-path to README, update spelling of torchat
1 parent a0bb484 commit 4b32284

File tree

1 file changed

+125
-21
lines changed

1 file changed

+125
-21
lines changed

README.md

+125-21
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,10 @@
44
items that are not factual. If you find an item that is incorrect, please tag as an issue, so we can triage and determine whether to fix,
55
or drop from our initial release.*
66

7-
# TorchAt *NORTHSTAR*
7+
# torchat *NORTHSTAR*
88
A repo for building and using llama on servers, desktops and mobile.
99

10-
The TorchAt repo enables model inference of llama models (and other LLMs) on servers, desktop and mobile devices.
10+
The torchat repo enables model inference of llama models (and other LLMs) on servers, desktop and mobile devices.
1111
For a list of devices, see below, under *SUPPORTED SYSTEMS*.
1212

1313
A goal of this repo, and the design of the PT2 components was to offer seamless integration and consistent workflows.
@@ -29,12 +29,12 @@ Featuring:
2929
and backend-specific mobile runtimes ("delegates", such as CoreML and Hexagon).
3030

3131
The model definition (and much more!) is adopted from gpt-fast, so we support the same models. As new models are supported by gpt-fast,
32-
bringing them into TorchAt should be straight forward. In addition, we invite community contributions
32+
bringing them into torchat should be straight forward. In addition, we invite community contributions
3333

3434
# Getting started
3535

3636
Follow the `gpt-fast` [installation instructions](https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#installation).
37-
Because TorchAt was designed to showcase the latest and greatest PyTorch 2 features for Llama (and related llama-style) models, many of the features used in TorchAt are hot off the press. [Download PyTorch nightly](https://pytorch.org/get-started/locally/) with the latest steaming hot PyTorch 2 features.
37+
Because torchat was designed to showcase the latest and greatest PyTorch 2 features for Llama (and related llama-style) models, many of the features used in torchat are hot off the press. [Download PyTorch nightly](https://pytorch.org/get-started/locally/) with the latest steaming hot PyTorch 2 features.
3838

3939

4040
Install sentencepiece and huggingface_hub
@@ -67,6 +67,10 @@ export MODEL_DOWNLOAD=meta-llama/Llama-2-7b-chat-hf
6767
While we strive to support a broad range of models, we can't test all models. Consequently, we classify supported models as tested ✅,
6868
work in progress 🚧 and not tested. We invite community contributions of both new models, as well as test reports.
6969

70+
Some common models are recognized by torchat based on their filename (`Transformer.from_name()`). For models not recognized based
71+
on the filename, you can construct a model by initializing the `ModelArgs` dataclass that controls model construction from a parameter json
72+
specified using the `params-path ${PARAMS_PATH}` containing the appropriate model parameters.
73+
7074
| Model | tested | eager | torch.compile | AOT Inductor | ET Runtime | Fits on Mobile |
7175
|-----|--------|-------|-----|-----|-----|-----|
7276
tinyllamas/stories15M | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
@@ -81,6 +85,7 @@ codellama/CodeLlama-34b-Python-hf | -| ✅ | ✅ | ✅ | ✅ | ❌ |
8185
mistralai/Mistral-7B-v0.1 | 🚧 | ✅ | ✅ | ✅ | ✅ | ❹ |
8286
mistralai/Mistral-7B-Instruct-v0.1 | - | ✅ | ✅ | ✅ | ✅ | ❹ |
8387
mistralai/Mistral-7B-Instruct-v0.2 | - | ✅ | ✅ | ✅ | ✅ | ❹ |
88+
Llama3 | 🚧 | ✅ | ✅ | ✅ | ✅ | ❹ |
8489

8590
*Key:* ✅ works correctly; 🚧 work in progress; ❌ not supported; ❹ requires 4bit groupwise quantization; 📵 not on mobile phone (may fit some high-end devices such as tablets);
8691

@@ -89,10 +94,10 @@ mistralai/Mistral-7B-Instruct-v0.2 | - | ✅ | ✅ | ✅ | ✅ | ❹ |
8994
### More downloading
9095

9196

92-
First cd into TorchAt. We first create a directory for stories15M and download the model and tokenizers.
97+
First cd into torchat. We first create a directory for stories15M and download the model and tokenizers.
9398
We show how to download @Andrej Karpathy's stories15M tiny llama-style model that were used in llama2.c. Advantageously,
9499
stories15M is both a great example and quick to download and run across a range of platforms, ideal for introductions like this
95-
README and for [testing](https://github.com/pytorch-labs/TorchAt/blob/main/.github/workflows). We will be using it throughout
100+
README and for [testing](https://github.com/pytorch-labs/torchat/blob/main/.github/workflows). We will be using it throughout
96101
this introduction as our running example.
97102

98103
```
@@ -122,11 +127,11 @@ We use several variables in this example, which may be set as a preparatory step
122127
name of the directory holding the files for the corresponding model. You *must* follow this convention to
123128
ensure correct operation.
124129

125-
* `MODEL_OUT` is the location where we store model and tokenizer information for a particular model. We recommend `checkpoints/${MODEL_NAME}`
130+
* `MODEL_DIR` is the location where we store model and tokenizer information for a particular model. We recommend `checkpoints/${MODEL_NAME}`
126131
or any other directory you already use to store model information.
127132

128133
* `MODEL_PATH` describes the location of the model. Throughput the description
129-
herein, we will assume that MODEL_PATH starts with a subdirectory of the TorchAt repo
134+
herein, we will assume that MODEL_PATH starts with a subdirectory of the torchat repo
130135
named checkpoints, and that it will contain the actual model. In this case, the MODEL_PATH will thus
131136
be of the form ${MODEL_OUT}/model.{pt,pth}. (Both the extensions `pt` and `pth`
132137
are used to describe checkpoints. In addition, model may be replaced with the name of the model.)
@@ -143,7 +148,7 @@ You can set these variables as follows for the exemplary model15M model from And
143148
MODEL_NAME=stories15M
144149
MODEL_DIR=checkpoints/${MODEL_NAME}
145150
MODEL_PATH=${MODEL_OUT}/stories15M.pt
146-
MODEL_OUT=~/TorchAt-exports
151+
MODEL_OUT=~/torchat-exports
147152
```
148153

149154
When we export models with AOT Inductor for servers and desktops, and Executorch for mobile and edge devices,
@@ -179,13 +184,20 @@ environment:
179184
./run ${MODEL_OUT}/model.{so,pte} -z ${MODEL_OUT}/tokenizer.bin
180185
```
181186

187+
### llama3 tokenizer
188+
189+
Add option to load tiktoken
190+
```
191+
--tiktoken
192+
```
193+
182194
# Generate Text
183195

184196
## Eager Execution
185197

186198
Model definition in model.py, generation code in generate.py. The
187199
model checkpoint may have extensions `pth` (checkpoint and model definition) or `pt` (model checkpoint).
188-
At present, we always use the TorchAt model for export and import the checkpoint into this model definition
200+
At present, we always use the torchat model for export and import the checkpoint into this model definition
189201
because we have tested that model with the export descriptions described herein.
190202

191203
```
@@ -223,7 +235,7 @@ quantization to achieve this, as described below.
223235

224236
We export the model with the export.py script. Running this script requires you first install executorch with pybindings, see [here](#setting-up-executorch-and-runner-et).
225237
At present, when exporting a model, the export command always uses the
226-
xnnpack delegate to export. (Future versions of TorchAt will support additional
238+
xnnpack delegate to export. (Future versions of torchat will support additional
227239
delegates such as Vulkan, CoreML, MPS, HTP in addition to Xnnpack as they are released for Executorch.)
228240

229241

@@ -250,8 +262,32 @@ device supported by Executorch, most models need to be compressed to
250262
fit in the target device's memory. We use quantization to achieve this.
251263

252264

265+
# llama3 support
266+
267+
How to obtain snapshot (to be filled in when published by Meta, we use internal snapshot]
268+
269+
enable llama3 tokenizer with option `--tiktoken` (see also discussion under tokenizer)
270+
271+
Enable all export options for llama3 as described below
272+
273+
Identify and enable a runner/run.cpp with a binary tiktoken optimizer. (May already be available in OSS)
274+
we cannot presently run runner/run.cpp with llama3, until we have a C/C++ tokenizer im[plementation
275+
(initial tiktoken is python)
276+
253277
# Optimizing your model for server, desktop and mobile devices
254278

279+
## Model precision (dtype precision setting)_
280+
281+
You can generate models (for both export and generate, with eager, torch.compile, AOTI, ET, for all backends - mobile at present will primarily support fp32, with all options)
282+
specify the precision of the model with
283+
```
284+
python generate.py --dtype [bf16 | fp16 | fp32] ...
285+
python export.py --dtype [bf16 | fp16 | fp32] ...
286+
```
287+
288+
Unlike gpt-fast which uses bfloat16 as default, Torch@ uses float32 as the default. As a consequence you will have to set to `--dtype bf16` or `--dtype fp16` on server / desktop for best performance.
289+
290+
255291
## Making your models fit and execute fast!
256292

257293
Next, we'll show you how to optimize your model for mobile execution
@@ -260,7 +296,7 @@ AOTI). The basic model build for mobile surfaces two issues: Models
260296
quickly run out of memory and execution can be slow. In this section,
261297
we show you how to fit your models in the limited memory of a mobile
262298
device, and optimize execution speed -- both using quantization. This
263-
is the `TorchAt` repo after all!
299+
is the `torchat` repo after all!
264300

265301
For high-performance devices such as GPUs, quantization provides a way
266302
to reduce the memory bandwidth required to and take advantage of the
@@ -274,6 +310,9 @@ We can specify quantization parameters with the --quantize option. The
274310
quantize option takes a JSON/dictionary with quantizers and
275311
quantization options.
276312

313+
generate and export (for both ET and AOTI) can both accept quantization options. We only show a subset of the combinations
314+
to avoid combinatorial explosion.
315+
277316
#### Embedding quantization (8 bit integer, channelwise & groupwise)
278317

279318
*Channelwise quantization*:
@@ -390,27 +429,58 @@ not been optimized for CUDA and CPU targets where the best
390429
performnance requires a group-wise quantized mixed dtype linear
391430
operator.
392431

432+
#### 4-bit integer quantization (int4)
433+
To compress your model even more, 4-bit integer quantization may be used. To achieve good accuracy, we recommend the use
434+
of groupwise quantization where (small to mid-sized) groups of int4 weights share a scale.
435+
```
436+
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:int4': {'group_size' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.pte | --output-dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso]
437+
```
438+
439+
Now you can run your model with the same command as before:
440+
```
441+
python generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.pte | --dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso] --prompt "Hello my name is"
442+
```
393443

394444
#### 4-bit integer quantization (8da4w)
395445
To compress your model even more, 4-bit integer quantization may be used. To achieve good accuracy, we recommend the use
396446
of groupwise quantization where (small to mid-sized) groups of int4 weights share a scale. We also quantize activations to 8-bit, giving
397447
this scheme its name (8da4w = 8b dynamically quantized activations with 4b weights), and boost performance.
398448
```
399-
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:8da4w': {'group_size' : 7} }" --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_8da4w.pte
449+
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:8da4w': {'group_size' : 7} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_8da4w.pte | ...dso... ]
400450
```
401451

402452
Now you can run your model with the same command as before:
403453
```
404-
python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_8da4w.pte --prompt "Hello my name is"
454+
python generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_8da4w.pte | ...dso...] --prompt "Hello my name is"
455+
```
456+
457+
#### Quantization with GPTQ (gptq)
458+
459+
```
460+
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:gptq': {'group_size' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso... ] # may require additional options, check with AO team
405461
```
406462

407-
#### Quantization with GPTQ (8da4w-gptq)
408-
TBD.
463+
Now you can run your model with the same command as before:
464+
```
465+
python generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso...] --prompt "Hello my name is"
466+
```
409467

410-
#### Adding additional quantization schemes
468+
#### Adding additional quantization schemes (hqq)
411469
We invite contributors to submit established quantization schemes, with accuracy and performance results demonstrating soundness.
412470

413471

472+
# Loading GGUF models
473+
474+
GGUF is a nascent industry standard format and will will read fp32, fp16 and some quantized formats (q4_0 and whatever is necessary to read llama2_78_q4_0.gguf)
475+
476+
```
477+
--load_gguf <gguf_filename> # all other options as described elsewhere, works for generate and export, for all backends, but cannot be used with --quantize
478+
```
479+
480+
```
481+
--dequantize_gguf <gguf_filename # all other options as described elsewhere, works for generate and export, for all backends, and be used with --quantize
482+
```
483+
414484
# Standalone Execution
415485

416486
In addition to running the exported and compiled models for server, desktop/laptop and mobile/edge devices by loading them in a PyTorch environment under the Python interpreter,
@@ -468,10 +538,13 @@ To run your pte model, use the following command (assuming you already generated
468538

469539
### Android
470540

471-
Check out the [tutorial on how to build an Android app running your PyTorch models with Executorch](https://pytorch.org/executorch/main/llm/llama-demo-android.html), and give your TorchAt models a spin.
541+
Check out the [tutorial on how to build an Android app running your PyTorch models with Executorch](https://pytorch.org/executorch/main/llm/llama-demo-android.html), and give your torchat models a spin.
472542

473543
![Screenshot](https://pytorch.org/executorch/main/_static/img/android_llama_app.png "Android app running Llama model")
474544

545+
Detailed step by step in conjunction with ET Android build, to run on simulator for Android. `scripts/android_example.sh` for running a model on an Android simulator (on Mac)
546+
547+
475548
### iOS
476549

477550
Open the iOS Llama Xcode project at https://github.com/pytorch/executorch/tree/main/examples/demo-apps/apple_ios/LLaMA/LLaMA.xcodeproj in Xcode and click Run.
@@ -482,6 +555,9 @@ Once you can run the app on you device,
482555
2 - copy the model and tokenizer.bin to the iOS Llama app
483556
3 - select the tokenizer and model with the `(...)` control (bottom left of screen, to the left of the text entrybox)
484557

558+
559+
Detailed step by step in conjunction with ET iOS build, to run on simulator for iOS.
560+
485561
# Supported Systems
486562

487563
PyTorch and the mobile Executorch backend support a broad range of devices for running PyTorch with Python (using either eager or eager + `torch.compile`) or using a Python-free environment with AOT Inductor, as well as runtimes for executing exported models.
@@ -527,6 +603,25 @@ PyTorch and the mobile Executorch backend support a broad range of devices for r
527603
| Raspberry Pi 4/5 | Android | ? | ? | ? | ? |
528604
| ARM 32b (up to v7) | any | | ? | ? | ? | ? |
529605

606+
## Runtime performance with Llama3, in tokens per second (4b quantization)
607+
608+
| Hardware | OS | eager | eager + compile | AOT compile | ET Runtime |
609+
|-----|------|-----|-----|-----|-----|
610+
| x86 | Linux | ? | ? | ? | ? |
611+
| x86 | macOS | ? | ? | ? | ? |
612+
| aarch64 | Linux | ? | ? | ? | ? |
613+
| aarch64 | macOS | ? | ? | ? | ? |
614+
| AMD GPU | Linux | ? | ? | ? | ? |
615+
| Nvidia GPU | Linux | ? | ? | ? | ? |
616+
| MPS | macOS | ? | ? | ? | ? |
617+
| MPS | iOS | ? | ? | ? | ? |
618+
| aarch64 | Android | ? | ? | ? | ? |
619+
| Mobile GPU (Vulkan) | Android | ? | ? | ? | ? |
620+
| CoreML | iOS | | ? | ? | ? | ? |
621+
| Hexagon DSP | Android | | ? | ? | ? | ? |
622+
| Raspberry Pi 4/5 | Raspbian | ? | ? | ? | ? |
623+
| Raspberry Pi 4/5 | Android | ? | ? | ? | ? |
624+
| ARM 32b (up to v7) | any | | ? | ? | ? | ? |
530625

531626
## Installation Instructions
532627

@@ -544,23 +639,23 @@ Alternatively, you can also find libraries here: https://mac.r-project.org/openm
544639
macOS running on x86 is reaching end-of-life. To use PyTorch on x86 running macOS, you can download prebuilt binaries up to PyTorch 2.2. You can download recent PyTorch releases and
545640
install them from source.
546641

547-
### iOS CoreML and MPS
642+
### iOS CoreML, Vulkan, MPS
548643

549644
List dependencies for these backends
550645

551646
### Setting up ExecuTorch and runner-et
552647
Set up ExecuTorch by following the instructions [here](https://pytorch.org/executorch/stable/getting-started-setup.html#setting-up-executorch).
553648
For convenience, we provide a script that does this for you.
554649

555-
From the TorchAt root directory, run the following
650+
From the torchat root directory, run the following
556651
```
557652
export LLAMA_FAST_ROOT=${PWD}
558653
./scripts/install_et.sh
559654
```
560655

561656
This will create a build directory, git clone ExecuTorch to ./build/src, applies some patches to the ExecuTorch source code, install the ExecuTorch python libraries with pip, and install the required ExecuTorch C++ libraries to ./build/install. This will take a while to complete.
562657

563-
After ExecuTorch is installed, you can build runner-et from the TorchAt root directory with the following
658+
After ExecuTorch is installed, you can build runner-et from the torchat root directory with the following
564659

565660
```
566661
export LLAMA_FAST_ROOT=${PWD}
@@ -570,6 +665,15 @@ cmake --build ./build/cmake-out
570665

571666
The built executable is located at ./build/cmake-out/runner-et.
572667

668+
### Tiktoken instructions & instructions for running llama3 without a python environment
669+
670+
for mobile and runner, if we can get a C/C++ tokenizer
671+
672+
673+
### Raspberry Pi 5 instructions
674+
675+
Expanded version of digant's note.
676+
573677
# Acknowledgements
574678

575679
A big thank you to

0 commit comments

Comments
 (0)