You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+125-21
Original file line number
Diff line number
Diff line change
@@ -4,10 +4,10 @@
4
4
items that are not factual. If you find an item that is incorrect, please tag as an issue, so we can triage and determine whether to fix,
5
5
or drop from our initial release.*
6
6
7
-
# TorchAt*NORTHSTAR*
7
+
# torchat*NORTHSTAR*
8
8
A repo for building and using llama on servers, desktops and mobile.
9
9
10
-
The TorchAt repo enables model inference of llama models (and other LLMs) on servers, desktop and mobile devices.
10
+
The torchat repo enables model inference of llama models (and other LLMs) on servers, desktop and mobile devices.
11
11
For a list of devices, see below, under *SUPPORTED SYSTEMS*.
12
12
13
13
A goal of this repo, and the design of the PT2 components was to offer seamless integration and consistent workflows.
@@ -29,12 +29,12 @@ Featuring:
29
29
and backend-specific mobile runtimes ("delegates", such as CoreML and Hexagon).
30
30
31
31
The model definition (and much more!) is adopted from gpt-fast, so we support the same models. As new models are supported by gpt-fast,
32
-
bringing them into TorchAt should be straight forward. In addition, we invite community contributions
32
+
bringing them into torchat should be straight forward. In addition, we invite community contributions
33
33
34
34
# Getting started
35
35
36
36
Follow the `gpt-fast`[installation instructions](https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#installation).
37
-
Because TorchAt was designed to showcase the latest and greatest PyTorch 2 features for Llama (and related llama-style) models, many of the features used in TorchAt are hot off the press. [Download PyTorch nightly](https://pytorch.org/get-started/locally/) with the latest steaming hot PyTorch 2 features.
37
+
Because torchat was designed to showcase the latest and greatest PyTorch 2 features for Llama (and related llama-style) models, many of the features used in torchat are hot off the press. [Download PyTorch nightly](https://pytorch.org/get-started/locally/) with the latest steaming hot PyTorch 2 features.
*Key:* ✅ works correctly; 🚧 work in progress; ❌ not supported; ❹ requires 4bit groupwise quantization; 📵 not on mobile phone (may fit some high-end devices such as tablets);
Model definition in model.py, generation code in generate.py. The
187
199
model checkpoint may have extensions `pth` (checkpoint and model definition) or `pt` (model checkpoint).
188
-
At present, we always use the TorchAt model for export and import the checkpoint into this model definition
200
+
At present, we always use the torchat model for export and import the checkpoint into this model definition
189
201
because we have tested that model with the export descriptions described herein.
190
202
191
203
```
@@ -223,7 +235,7 @@ quantization to achieve this, as described below.
223
235
224
236
We export the model with the export.py script. Running this script requires you first install executorch with pybindings, see [here](#setting-up-executorch-and-runner-et).
225
237
At present, when exporting a model, the export command always uses the
226
-
xnnpack delegate to export. (Future versions of TorchAt will support additional
238
+
xnnpack delegate to export. (Future versions of torchat will support additional
227
239
delegates such as Vulkan, CoreML, MPS, HTP in addition to Xnnpack as they are released for Executorch.)
228
240
229
241
@@ -250,8 +262,32 @@ device supported by Executorch, most models need to be compressed to
250
262
fit in the target device's memory. We use quantization to achieve this.
251
263
252
264
265
+
# llama3 support
266
+
267
+
How to obtain snapshot (to be filled in when published by Meta, we use internal snapshot]
268
+
269
+
enable llama3 tokenizer with option `--tiktoken` (see also discussion under tokenizer)
270
+
271
+
Enable all export options for llama3 as described below
272
+
273
+
Identify and enable a runner/run.cpp with a binary tiktoken optimizer. (May already be available in OSS)
274
+
we cannot presently run runner/run.cpp with llama3, until we have a C/C++ tokenizer im[plementation
275
+
(initial tiktoken is python)
276
+
253
277
# Optimizing your model for server, desktop and mobile devices
254
278
279
+
## Model precision (dtype precision setting)_
280
+
281
+
You can generate models (for both export and generate, with eager, torch.compile, AOTI, ET, for all backends - mobile at present will primarily support fp32, with all options)
Unlike gpt-fast which uses bfloat16 as default, Torch@ uses float32 as the default. As a consequence you will have to set to `--dtype bf16` or `--dtype fp16` on server / desktop for best performance.
289
+
290
+
255
291
## Making your models fit and execute fast!
256
292
257
293
Next, we'll show you how to optimize your model for mobile execution
@@ -260,7 +296,7 @@ AOTI). The basic model build for mobile surfaces two issues: Models
260
296
quickly run out of memory and execution can be slow. In this section,
261
297
we show you how to fit your models in the limited memory of a mobile
262
298
device, and optimize execution speed -- both using quantization. This
263
-
is the `TorchAt` repo after all!
299
+
is the `torchat` repo after all!
264
300
265
301
For high-performance devices such as GPUs, quantization provides a way
266
302
to reduce the memory bandwidth required to and take advantage of the
@@ -274,6 +310,9 @@ We can specify quantization parameters with the --quantize option. The
274
310
quantize option takes a JSON/dictionary with quantizers and
275
311
quantization options.
276
312
313
+
generate and export (for both ET and AOTI) can both accept quantization options. We only show a subset of the combinations
314
+
to avoid combinatorial explosion.
315
+
277
316
#### Embedding quantization (8 bit integer, channelwise & groupwise)
278
317
279
318
*Channelwise quantization*:
@@ -390,27 +429,58 @@ not been optimized for CUDA and CPU targets where the best
390
429
performnance requires a group-wise quantized mixed dtype linear
391
430
operator.
392
431
432
+
#### 4-bit integer quantization (int4)
433
+
To compress your model even more, 4-bit integer quantization may be used. To achieve good accuracy, we recommend the use
434
+
of groupwise quantization where (small to mid-sized) groups of int4 weights share a scale.
Now you can run your model with the same command as before:
403
453
```
404
-
python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_8da4w.pte --prompt "Hello my name is"
454
+
python generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_8da4w.pte | ...dso...] --prompt "Hello my name is"
455
+
```
456
+
457
+
#### Quantization with GPTQ (gptq)
458
+
459
+
```
460
+
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:gptq': {'group_size' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso... ] # may require additional options, check with AO team
405
461
```
406
462
407
-
#### Quantization with GPTQ (8da4w-gptq)
408
-
TBD.
463
+
Now you can run your model with the same command as before:
464
+
```
465
+
python generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso...] --prompt "Hello my name is"
466
+
```
409
467
410
-
#### Adding additional quantization schemes
468
+
#### Adding additional quantization schemes (hqq)
411
469
We invite contributors to submit established quantization schemes, with accuracy and performance results demonstrating soundness.
412
470
413
471
472
+
# Loading GGUF models
473
+
474
+
GGUF is a nascent industry standard format and will will read fp32, fp16 and some quantized formats (q4_0 and whatever is necessary to read llama2_78_q4_0.gguf)
475
+
476
+
```
477
+
--load_gguf <gguf_filename> # all other options as described elsewhere, works for generate and export, for all backends, but cannot be used with --quantize
478
+
```
479
+
480
+
```
481
+
--dequantize_gguf <gguf_filename # all other options as described elsewhere, works for generate and export, for all backends, and be used with --quantize
482
+
```
483
+
414
484
# Standalone Execution
415
485
416
486
In addition to running the exported and compiled models for server, desktop/laptop and mobile/edge devices by loading them in a PyTorch environment under the Python interpreter,
@@ -468,10 +538,13 @@ To run your pte model, use the following command (assuming you already generated
468
538
469
539
### Android
470
540
471
-
Check out the [tutorial on how to build an Android app running your PyTorch models with Executorch](https://pytorch.org/executorch/main/llm/llama-demo-android.html), and give your TorchAt models a spin.
541
+
Check out the [tutorial on how to build an Android app running your PyTorch models with Executorch](https://pytorch.org/executorch/main/llm/llama-demo-android.html), and give your torchat models a spin.
Detailed step by step in conjunction with ET Android build, to run on simulator for Android. `scripts/android_example.sh` for running a model on an Android simulator (on Mac)
546
+
547
+
475
548
### iOS
476
549
477
550
Open the iOS Llama Xcode project at https://github.com/pytorch/executorch/tree/main/examples/demo-apps/apple_ios/LLaMA/LLaMA.xcodeproj in Xcode and click Run.
@@ -482,6 +555,9 @@ Once you can run the app on you device,
482
555
2 - copy the model and tokenizer.bin to the iOS Llama app
483
556
3 - select the tokenizer and model with the `(...)` control (bottom left of screen, to the left of the text entrybox)
484
557
558
+
559
+
Detailed step by step in conjunction with ET iOS build, to run on simulator for iOS.
560
+
485
561
# Supported Systems
486
562
487
563
PyTorch and the mobile Executorch backend support a broad range of devices for running PyTorch with Python (using either eager or eager + `torch.compile`) or using a Python-free environment with AOT Inductor, as well as runtimes for executing exported models.
@@ -527,6 +603,25 @@ PyTorch and the mobile Executorch backend support a broad range of devices for r
527
603
| Raspberry Pi 4/5 | Android | ? | ? | ? | ? |
528
604
| ARM 32b (up to v7) | any || ? | ? | ? | ? |
529
605
606
+
## Runtime performance with Llama3, in tokens per second (4b quantization)
607
+
608
+
| Hardware | OS | eager | eager + compile | AOT compile | ET Runtime |
609
+
|-----|------|-----|-----|-----|-----|
610
+
| x86 | Linux | ? | ? | ? | ? |
611
+
| x86 | macOS | ? | ? | ? | ? |
612
+
| aarch64 | Linux | ? | ? | ? | ? |
613
+
| aarch64 | macOS | ? | ? | ? | ? |
614
+
| AMD GPU | Linux | ? | ? | ? | ? |
615
+
| Nvidia GPU | Linux | ? | ? | ? | ? |
616
+
| MPS | macOS | ? | ? | ? | ? |
617
+
| MPS | iOS | ? | ? | ? | ? |
618
+
| aarch64 | Android | ? | ? | ? | ? |
619
+
| Mobile GPU (Vulkan) | Android | ? | ? | ? | ? |
620
+
| CoreML | iOS || ? | ? | ? | ? |
621
+
| Hexagon DSP | Android || ? | ? | ? | ? |
622
+
| Raspberry Pi 4/5 | Raspbian | ? | ? | ? | ? |
623
+
| Raspberry Pi 4/5 | Android | ? | ? | ? | ? |
624
+
| ARM 32b (up to v7) | any || ? | ? | ? | ? |
530
625
531
626
## Installation Instructions
532
627
@@ -544,23 +639,23 @@ Alternatively, you can also find libraries here: https://mac.r-project.org/openm
544
639
macOS running on x86 is reaching end-of-life. To use PyTorch on x86 running macOS, you can download prebuilt binaries up to PyTorch 2.2. You can download recent PyTorch releases and
545
640
install them from source.
546
641
547
-
### iOS CoreML and MPS
642
+
### iOS CoreML, Vulkan, MPS
548
643
549
644
List dependencies for these backends
550
645
551
646
### Setting up ExecuTorch and runner-et
552
647
Set up ExecuTorch by following the instructions [here](https://pytorch.org/executorch/stable/getting-started-setup.html#setting-up-executorch).
553
648
For convenience, we provide a script that does this for you.
554
649
555
-
From the TorchAt root directory, run the following
650
+
From the torchat root directory, run the following
556
651
```
557
652
export LLAMA_FAST_ROOT=${PWD}
558
653
./scripts/install_et.sh
559
654
```
560
655
561
656
This will create a build directory, git clone ExecuTorch to ./build/src, applies some patches to the ExecuTorch source code, install the ExecuTorch python libraries with pip, and install the required ExecuTorch C++ libraries to ./build/install. This will take a while to complete.
562
657
563
-
After ExecuTorch is installed, you can build runner-et from the TorchAt root directory with the following
658
+
After ExecuTorch is installed, you can build runner-et from the torchat root directory with the following
0 commit comments