Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace WeightOnlyInt8Linear with TorchAO int8_weight_only quantization #1328

Closed
wants to merge 84 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
84 commits
Select commit Hold shift + click to select a range
92e0a9d
Replace WeightOnlyInt8Linear with TorchAO int8_weight_only quantization
Oct 24, 2024
1a42fb6
Merge branch 'main' into torchao_int8_weight_only
vmpuri Nov 12, 2024
93d9876
fix: enforce python version install requirements (#1345)
leseb Nov 12, 2024
d1d6aa1
Remove last references to use_distributed argument (#1353)
mreso Nov 13, 2024
a286e58
Add cstdint to tokenizer (missing include) (#1339)
byjlw Nov 13, 2024
a655d58
Setup a SIGINT handler to gracefully exit the program once the user p…
leseb Nov 13, 2024
8811c7e
Update cli.py to make --device/--dtype pre-empt quantize dict-specifi…
mikekgfb Nov 13, 2024
483928b
Update Caching logic to only trigger on the first inference sample (#…
Jack-Khuu Nov 13, 2024
add35e8
Minor typo + Update install_requirements.sh to support python 3.10 >=…
Jack-Khuu Nov 13, 2024
008fea0
fix: Remove dup gguf dependency (#1371)
leseb Nov 14, 2024
d2e4995
Bug Fix: Check for explicit cli device (fast) (#1374)
Jack-Khuu Nov 14, 2024
bc2c2d0
fix: do not print perf stat when NaN (#1375)
leseb Nov 15, 2024
4eb7fbb
fix: Fail gracefully when "model" arg is missing when downloading (#1…
leseb Nov 16, 2024
d62680c
Ignore tokens per sec from jit_compile iteration (#1378)
yanbing-j Nov 19, 2024
c0630a6
Download fix (#1366)
gabe-l-hart Nov 19, 2024
fe76c85
Update builder.py (#1387)
mikekgfb Nov 19, 2024
8478e5d
Add multimodal to possible tests (#1382)
mikekgfb Nov 19, 2024
5e18de7
Fix typo in RuntimeException in builder.py (#1386)
mikekgfb Nov 20, 2024
8475c79
Bug fix: Enable fast to override quantize json (#1377)
Jack-Khuu Nov 20, 2024
731936d
Changing the referenced AAR so that it uses the AAR from the docs (#1…
infil00p Nov 23, 2024
554cf86
Typo fixes in native-execution.md (#1394)
mikekgfb Nov 26, 2024
dadaade
Improvements for readability in ADVANCED-USERS.md (#1393)
mikekgfb Nov 26, 2024
c7bb8b9
Update multimodal.md to exercise server as part of test (#1391)
mikekgfb Nov 26, 2024
b0abf27
Update quantization.md link to quantize.py (#1392)
Jack-Khuu Dec 3, 2024
b870f7e
Bump torch pin to 20241010 (#1400)
larryliu0820 Dec 6, 2024
4e621ce
Use pytorch-labs/tokenizers and remove tokenizer/ (#1401)
larryliu0820 Dec 7, 2024
6e40ec0
Update PT Pin to 1013 (#1407)
Jack-Khuu Dec 9, 2024
d979da1
Update docs for max-autotune usage (#1405)
yanbing-j Dec 9, 2024
6d6f2b9
Update run-docs to include `run-docs native` (#1403)
mikekgfb Dec 9, 2024
46b784e
Update README.md to run and query server during test (#1384)
mikekgfb Dec 9, 2024
2c03a2a
Update run-docs to enable `run-docs evaluation` (#1383)
mikekgfb Dec 9, 2024
e1fefc0
Revert "Use pytorch-labs/tokenizers and remove tokenizer/ (#1401)" (#…
Jack-Khuu Dec 10, 2024
bc0c1dc
Update README.md (whitespace) (#1412)
mikekgfb Dec 10, 2024
dfbd865
Update evaluation.md to include AOTI (#1411)
mikekgfb Dec 10, 2024
19ecd95
Update ADVANCED-USERS.md (#1396)
mikekgfb Dec 11, 2024
1315275
Bump PT pin to 20241028 (#1419)
Jack-Khuu Dec 12, 2024
1d7e71f
Avoid curl fails due to server startup time in CI(#1418)
mikekgfb Dec 12, 2024
36d0712
Add torchao mps ops (#1415)
manuelcandales Dec 13, 2024
5bc5552
Multi Pin Bumps across PT/AO/tune/ET: pt dev20241213 (#1367)
Jack-Khuu Dec 14, 2024
902542d
Update int4pack related in torchchat gguf (#1404)
yanbing-j Dec 17, 2024
6de1a01
update torchao pin: optimized shaders (#1428)
manuelcandales Dec 18, 2024
ff2d53c
Update install_requirements.sh to tune + pt/pt dev20241218 (#1426)
Jack-Khuu Dec 19, 2024
5e16167
Add Granite code support (#1336)
gabe-l-hart Dec 19, 2024
582e558
Fix 3.2 11B inference, by updating padded_collate_tiled_images_and_ma…
Jack-Khuu Dec 19, 2024
7dad56f
Integrate distributed inference with chat/server (#1381)
mreso Dec 19, 2024
155bd4b
Granite 3.0 / 3.1 dense support (#1432)
gabe-l-hart Dec 20, 2024
a325191
Fix typo in quantize.py (#1434)
mikekgfb Dec 23, 2024
86efcd3
Update sh -> bash in quantization.md (#1437)
mikekgfb Dec 23, 2024
a1ba6a1
Output explicit selection of /bin/bash as interpreter for test script…
mikekgfb Dec 23, 2024
490ad39
Fix how stream flag is read from request (#1441)
mreso Dec 25, 2024
b95074b
[retry] Use pytorch-labs/tokenizers and remove tokenizer/ (#1401) (#1…
larryliu0820 Jan 3, 2025
3f0fec3
Update README.md to include granite (#1445)
mikekgfb Jan 5, 2025
c121ed2
Create local-model.md (#1448)
mikekgfb Jan 6, 2025
e60680b
Update evaluation.md (#1442)
mikekgfb Jan 6, 2025
1ba40d7
Create distributed.md (#1438)
mikekgfb Jan 6, 2025
06e78ce
[aoti] Remove need for -l in cmake (#1159)
angelayi Jan 15, 2025
6bfc5c8
Bumping ET Pin to Jan16 2025 (#1459)
Jack-Khuu Jan 17, 2025
d625f72
Fix typo in quantize.py (#1461)
mikekgfb Jan 17, 2025
e5543e2
Update run-readme-pr-mps.yml for typo (#1460)
mikekgfb Jan 17, 2025
2d96e48
Add Intel XPU device support to generate and serve (#1361)
jenniew Jan 18, 2025
defc225
Create run-readme-pr-linuxaarch64 (#1350)
mikekgfb Jan 21, 2025
2227014
Bump test-readme-mps-macos timeout (#1451)
mikekgfb Jan 21, 2025
bc0f93a
Update torch/tune/vision pins to 1/19/25 (#1467)
Jack-Khuu Jan 22, 2025
cd10377
Add warning in PTEModel when not defined (#1468)
Jack-Khuu Jan 22, 2025
ef58fce
Add attention_backend as a configurable option (#1456)
yanbing-j Jan 22, 2025
601f2d1
Update import of sdpa_with_kv_cache to custom_ops (#1470)
Jack-Khuu Jan 22, 2025
083960b
Typo: Fix generate signature type hint for attention_backend (#1471)
Jack-Khuu Jan 22, 2025
a942c16
chat: Change role to user for user prompts (#1447)
vladoovtcharov Jan 22, 2025
f514b35
Update run-readme-pr-linuxaarch64.yml to use correct runner (#1469)
Jack-Khuu Jan 23, 2025
c536da4
Increment start_pos by encoded size in generate (#1462)
nlpfollower Jan 23, 2025
8662471
Explicitly turning off pybindings for ExecuTorch unless requested (#1…
Jack-Khuu Jan 24, 2025
a64b9e3
Replace RMSNorm by nn.RMSNorm (#1464)
manuelcandales Jan 24, 2025
84d2232
Update aoti calls to utilize new export and packaging APIs (#1455)
angelayi Jan 24, 2025
1c2f5aa
Update numpy requirements to no longer upper bound on 2.0 (#1479)
Jack-Khuu Jan 24, 2025
59e168e
Add evaluation, multimodal, native tests to run-readme-pr-macos.yml (…
mikekgfb Jan 24, 2025
7b3a5fd
Add evaluation, multimodal, native tests to run-readme-pr-mps.yml (#1…
mikekgfb Jan 24, 2025
4e2c384
Force run-readme-pr-macos.yml to use CPU instead of incorrectly loadi…
mikekgfb Jan 24, 2025
8bae547
Add distributed tests to run-readme-pr.yml (#1466)
mikekgfb Jan 27, 2025
eba2b07
Update run-docs to avoid code duplication (#1439)
mikekgfb Jan 30, 2025
2f34fee
Add `export --output-snapshot-path snap.tc`, and `--snapshot-path sna…
mikekgfb Jan 31, 2025
ad7f85a
Update check_gibberish to check for aspell availability(#1487)
mikekgfb Jan 31, 2025
31ecb18
Add DeepSeek R1 Distill 8B (#1488)
Jack-Khuu Feb 3, 2025
5f9b347
Replace WeightOnlyInt8Linear with TorchAO int8_weight_only quantization
Oct 24, 2024
8b1af3f
Fallback to original quantization if float16
Feb 4, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .ci/scripts/check_gibberish
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,18 @@ else
fi
fi

#######################################################################
#
# check whether aspell spell check evailable

if command -v aspell &> /dev/null; then
echo "Checking $TMPFILE for gibberish"
else
echo "Aspell is not installed or not in PATH."
echo "Gibberish unchecked in $TMPFILE"
exit 0
fi

#######################################################################
#
# run spell check on the extracted sequence
Expand Down
150 changes: 62 additions & 88 deletions .ci/scripts/run-docs
Original file line number Diff line number Diff line change
@@ -1,93 +1,67 @@
# /bin/bash -x
#!/bin/bash -x

if [ "X$1" == "X" ]; then
# Check if an argument was provided
if [ -z "$1" ]; then
echo "Must specify document to run"
exit 1
fi

if [ "$1" == "readme" ]; then
echo "::group::Create script to run README"
python3 torchchat/utils/scripts/updown.py --create-sections --file README.md --replace 'llama3.1:stories15M,-l 3:-l 2' --suppress huggingface-cli,HF_TOKEN > ./run-readme.sh
# for good measure, if something happened to updown processor,
# and it did not error out, fail with an exit 1
echo "exit 1" >> ./run-readme.sh
echo "::endgroup::"

echo "::group::Run README"
echo "*******************************************"
cat ./run-readme.sh
echo "*******************************************"
bash -x ./run-readme.sh
echo "::endgroup::"

exit 0
fi

if [ "$1" == "quantization" ]; then
echo "::group::Create script to run quantization"
python3 torchchat/utils/scripts/updown.py --create-sections --file docs/quantization.md --replace llama3:stories15M --suppress huggingface-cli,HF_TOKEN > ./run-quantization.sh
# for good measure, if something happened to updown processor,
# and it did not error out, fail with an exit 1
echo "exit 1" >> ./run-quantization.sh
echo "::endgroup::"

echo "::group::Run quantization"
echo "*******************************************"
cat ./run-quantization.sh
echo "*******************************************"
bash -x ./run-quantization.sh
echo "::endgroup::"

exit 0
fi

if [ "$1" == "gguf" ]; then
echo "::group::Create script to run gguf"
python3 torchchat/utils/scripts/updown.py --file docs/GGUF.md --replace 'llama3:stories15M,-l 3:-l 2' --suppress huggingface-cli,HF_TOKEN > ./run-gguf.sh
# for good measure, if something happened to updown processor,
# and it did not error out, fail with an exit 1
echo "exit 1" >> ./run-gguf.sh
echo "::endgroup::"

echo "::group::Run gguf"
echo "*******************************************"
cat ./run-gguf.sh
echo "*******************************************"
bash -x ./run-gguf.sh
echo "::endgroup::"
fi


if [ "$1" == "advanced" ]; then
echo "::group::Create script to run advanced"
python3 torchchat/utils/scripts/updown.py --file docs/ADVANCED-USERS.md --replace 'llama3:stories15M,-l 3:-l 2' --suppress huggingface-cli,HF_TOKEN > ./run-advanced.sh
# for good measure, if something happened to updown processor,
# and it did not error out, fail with an exit 1
echo "exit 1" >> ./run-advanced.sh
echo "::endgroup::"

echo "::group::Run advanced"
echo "*******************************************"
cat ./run-advanced.sh
echo "*******************************************"
bash -x ./run-advanced.sh
echo "::endgroup::"
fi

if [ "$1" == "evaluation" ]; then

exit 0

echo "::group::Create script to run evaluation"
python3 torchchat/utils/scripts/updown.py --file torchchat/utils/docs/evaluation.md --replace 'llama3:stories15M,-l 3:-l 2' --suppress huggingface-cli,HF_TOKEN > ./run-evaluation.sh
# for good measure, if something happened to updown processor,
# and it did not error out, fail with an exit 1
echo "exit 1" >> ./run-evaluation.sh
echo "::endgroup::"

echo "::group::Run evaluation"
echo "*******************************************"
cat ./run-evaluation.sh
echo "*******************************************"
bash -x ./run-evaluation.sh
fi
# Pre-initialize variables
filepath=""
parameters="--replace 'llama3:stories15M,-l3:-l2' --suppress huggingface-cli,HF_TOKEN"
script_name="./run-${1}.sh" # Dynamically initialize script name

# Use a case statement to handle the $1 argument
case "$1" in
"readme")
filepath="README.md"
;;
"quantization")
filepath="docs/quantization.md"
;;
"gguf")
filepath="docs/GGUF.md"
;;
"advanced")
filepath="docs/ADVANCED-USERS.md"
;;
"evaluation")
filepath="torchchat/utils/docs/evaluation.md"
;;
"multimodal")
filepath="docs/multimodal.md"
parameters="" # Clear parameters
;;
"native")
filepath="docs/native-execution.md"
parameters="" # Clear parameters
;;
"distributed")
filepath="docs/distributed.md"
parameters="--replace 'llama3.1:stories110M,-l3:-l2' --suppress huggingface-cli,HF_TOKEN" # Use stories110M to avoid need for authentication
;;
"local")
filepath="docs/local-model.md"
parameters="" # Clear parameters
;;

*)
echo "Unknown option: $1"
exit 1
;;
esac

# Generate the script
echo "::group::Create script to run $1"
python3 torchchat/utils/scripts/updown.py --file "$filepath" $parameters > "$script_name"
# if something happened to updown processor, and it did not error out, fail with an exit 1
echo "exit 1" >> "$script_name"
echo "::endgroup::"

# Run the script
echo "::group::Run $1"
echo "*******************************************"
cat "$script_name"
echo "*******************************************"
bash -x "$script_name"
echo "::endgroup::"
13 changes: 5 additions & 8 deletions .github/workflows/more-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,23 +9,20 @@ on:

jobs:
test-cuda:
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
permissions:
id-token: write
contents: read
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
with:
runner: linux.g5.4xlarge.nvidia.gpu
gpu-arch-type: cuda
gpu-arch-version: "12.1"
gpu-arch-version: "12.4"
timeout: 60
script: |
echo "::group::Print machine info"
uname -a
echo "::endgroup::"

echo "::group::Install newer objcopy that supports --set-section-alignment"
yum install -y devtoolset-10-binutils
export PATH=/opt/rh/devtoolset-10/root/usr/bin/:$PATH
echo "::endgroup::"


echo "::group::Download checkpoints"
# Install requirements
./install/install_requirements.sh cuda
Expand Down
7 changes: 5 additions & 2 deletions .github/workflows/periodic.yml
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,10 @@ jobs:
set -eux
PYTHONPATH="${PWD}" python .ci/scripts/gather_test_models.py --event "periodic" --backend "gpu"
test-gpu:
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
permissions:
id-token: write
contents: read
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
name: test-gpu (${{ matrix.platform }}, ${{ matrix.model_name }})
needs: gather-models-gpu
secrets: inherit
Expand All @@ -119,7 +122,7 @@ jobs:
secrets-env: "HF_TOKEN_PERIODIC"
runner: ${{ matrix.runner }}
gpu-arch-type: cuda
gpu-arch-version: "12.1"
gpu-arch-version: "12.4"
script: |
echo "::group::Print machine info"
nvidia-smi
Expand Down
Loading