source instead of exec in run-readme-pr-macos.yml #1476

mikekgfb · 2025-01-24T02:36:50Z

source test commands instead of executing them.
(Possible fix for #1315 )

source test commands instead of executing them. (Possible fix for pytorch#1315 )

pytorch-bot · 2025-01-24T02:36:54Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1476

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 6a5c3a7 with merge base 201411c ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-torchao-aoti-experimental (macos-14-xlarge) (gh) (trunk failure)
Process completed with exit code 134.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

source instead of exec

somebody pushed all the model exports into exportedModels, but... we never create the directory. we should do that also do this in the user instructions, just because storing into a directory that doesn't exist is not good :)

mikekgfb · 2025-01-24T03:15:34Z

@Jack-Khuu when it rains it pours, that showed more false positives per #1315 than anybody could anticipate!

I added the directory that all the examples use (but don't create!), or we can just removed the directory...

mikekgfb · 2025-01-24T16:48:52Z

@Jack-Khuu Ideally we start a run for 1476, and in parallel commit 1409, 1410, 1417, 1439, 1455, 1466.
1476 may lead to little changes in test setup etc, and it’ll avoid difficult merges with cleanup and alignment that’s 1409-1466. Wdyt?

PS: In a nutshell, failures for the doc based runs haven't bubbled up because a failure inside a shell script that's executed with bash did not seem to pass failure information to upstream. Using source to run the multiple layers rectifies this, and may be a pragmatic answer to restoring full test coverage. (I think right now we've cought some of the fails that have not bubbled up to hud.pytorch.org because of the exec/bash dichotomy by eye balling which is not a healthy long term solution.

Jack-Khuu · 2025-01-24T18:03:03Z

Yup, making my way through those CI PR's then we'll rebase this one

Our current coverage has plenty of gaps and has honestly been adhoc, so revamping the CI and creating a comprehensive unittest system is a P0 KR for us this half (working on it with @Gasoonjia).

Thanks again for grinding through these!!

mikekgfb · 2025-01-24T20:27:31Z

Yup, making my way through those CI PR's then we'll rebase this one

Our current coverage has plenty of gaps and has honestly been adhoc, so revamping the CI and creating a comprehensive unittest system is a P0 KR for us this half (working on it with @Gasoonjia).

Thanks again for grinding through these!!

pip3 not found. I guess we do conda for this environment. That's interesting. How do we deal with conda like that. or is it just pip vs pip3. (alias pip3 pip?)

https://github.com/pytorch/torchchat/actions/runs/12942557291/job/36137959147?pr=1476

multimodal doc needed end of tests comment.

Need to download files before using them, lol. We expect the users to do this, but we should verbalize. Plus, if we extract for testing, then it obviously fails.

( triggers unexpected token in macos zsh

# metadata does not install properly on macos # .ci/scripts/run-docs multimodal

mikekgfb · 2025-01-25T01:10:12Z

https://hud.pytorch.org/pr/pytorch/torchchat/1476#36153855033

conda: command not found

install wget

echo ".ci/scripts/run-docs native DISABLED" # .ci/scripts/run-docs native

mikekgfb · 2025-01-28T00:03:10Z

pip3 command not found. This is called from ./install/install_requirements.sh.

  Using python executable: python3
  Using pip executable: pip3
  + pip3 install -r install/requirements.txt --extra-index-url https://download.pytorch.org/whl/nightly/cpu
  ./install/install_requirements.sh: line 101: pip3: command not found
  ++ handle_error

@Jack-Khuu

Jack-Khuu · 2025-01-28T00:19:54Z

Looks like #1362 for the mismatched group size is finally marked as failing properly

cc: @Gasoonjia

mikekgfb · 2025-01-28T19:00:21Z

Some issues: we can't find pip3 and/or conda.

https://github.com/pytorch/torchchat/actions/runs/12996559809/job/36252363658
test-gguf-cpu / linux job:

  Using pip executable: pip3
  + pip3 install -r install/requirements.txt --extra-index-url https://download.pytorch.org/whl/nightly/cpu
  ./install/install_requirements.sh: line 101: pip3: command not found

https://ossci-raw-job-status.s3.amazonaws.com/log/pytorch/torchchat/36252362180
test-evaluation-cpu / linux-job:

2025-01-27T21:31:18.6200400Z adae5085138d826cf1da08057e639b566f6a5a9fa338be7e4d6a1b5e7bac14fc
2025-01-27T21:31:18.6201409Z Running command: docker exec -t adae5085138d826cf1da08057e639b566f6a5a9fa338be7e4d6a1b5e7bac14fc /exec
2025-01-27T21:31:18.6202293Z /exec: line 3: conda: command not found

https://ossci-raw-job-status.s3.amazonaws.com/log/pytorch/torchchat/36252360312
test-readme-cpu / linux-job

2025-01-27T21:31:18.6200400Z adae5085138d826cf1da08057e639b566f6a5a9fa338be7e4d6a1b5e7bac14fc
2025-01-27T21:31:18.6201409Z Running command: docker exec -t adae5085138d826cf1da08057e639b566f6a5a9fa338be7e4d6a1b5e7bac14fc /exec
2025-01-27T21:31:18.6202293Z /exec: line 3: conda: command not found

@Jack-Khuu

mikekgfb · 2025-01-28T23:32:16Z

The following is an issue specific to the use of stories... because the features aren't multiple of 256 groupsize. Originally, I had included padding or otherwise support for handling this (Embeddings quantization just handles an "unfull" group for example). Since moving to torchao we're insisting on the multiplicity of the features size and group size.

    File "/opt/conda/lib/python3.11/site-packages/torchao/quantization/GPTQ.py", line 1142, in _create_quantized_state_dict
      in_features % self.groupsize == 0
  AssertionError: require in_features:288 % self.groupsize:256 == 0

Options:
1 - FInd another small model we might use in lieu of stories15M that meets the divisible by 256 threshhold
2 - We can rewrite to groupsize 32 for the tests. But of course that won't test the groupsize 256 that we point users to do for the larger models.
3 -Revisit support for non-multiple groupwise quantization in torchao.

I'll assume that (3) might take a while for discussion and implementation and going with (1) or (2) is probably the pragmatic solution. (With the caveat that (2) won't test gs=256, but it may be the quickest to implement and not sure what the smallest model for (1) is. (I haven't looked at Stories 110M which may be an acceptable stand-in re: feature sizes being multiple of 256, although it will drive up runtime of our tests....)

Resolved via (2) for now

switch to gs=32 quantization (requires consolidated run-docs of pytorch#1439)

add gs=32 cuda quantization for use w/ stories15M

add gs=32 for stories15M

mikekgfb · 2025-01-29T08:57:09Z

https://github.com/pytorch/torchchat/actions/runs/13021784616/job/36327025188?pr=1476

test-advanced-any

  W0129 01:52:50.181000 13569 site-packages/torch/_export/__init__.py:67] +============================+
  W0129 01:52:50.181000 13569 site-packages/torch/_export/__init__.py:68] |     !!!   WARNING   !!!    |
  W0129 01:52:50.181000 13569 site-packages/torch/_export/__init__.py:69] +============================+
  W0129 01:52:50.181000 13569 site-packages/torch/_export/__init__.py:70] torch._export.aot_compile()/torch._export.aot_load() is being deprecated, please switch to directly calling torch._inductor.aoti_compile_and_package(torch.export.export())/torch._inductor.aoti_load_package() instead.
  Traceback (most recent call last):
    File "/pytorch/torchchat/torchchat/cli/builder.py", line 564, in _initialize_model
      model.forward = torch._export.aot_load(
                      ^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.11/site-packages/torch/_export/__init__.py", line 163, in aot_load
  Traceback (most recent call last):
      runner = torch._C._aoti.AOTIModelContainerRunnerCpu(so_path, 1)  # type: ignore[call-arg]
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  RuntimeError: Error in dlopen: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /pytorch/torchchat/stories15M.so)
  
  During handling of the above exception, another exception occurred:
  
  Traceback (most recent call last):
    File "/pytorch/torchchat/torchchat.py", line 96, in <module>
      generate_main(args)
    File "/pytorch/torchchat/torchchat/generate.py", line 1651, in main
      run_generator(args)
    File "/pytorch/torchchat/torchchat/generate.py", line 1619, in run_generator
      gen = Generator(
            ^^^^^^^^^^
    File "/pytorch/torchchat/torchchat/generate.py", line 381, in __init__
      self.model = _initialize_model(self.builder_args, self.quantize, self.tokenizer)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/pytorch/torchchat/torchchat/cli/builder.py", line 568, in _initialize_model
      raise RuntimeError(f"Failed to load AOTI compiled {builder_args.dso_path}")
  RuntimeError: Failed to load AOTI compiled stories15M.so
    File "/home/ec2-user/actions-runner/_work/torchchat/torchchat/test-infra/.github/scripts/run_with_env_secrets.py", line 102, in <module>
      main()
    File "/home/ec2-user/actions-runner/_work/torchchat/torchchat/test-infra/.github/scripts/run_with_env_secrets.py", line 98, in main
      run_cmd_or_die(f"docker exec -t {container_name} /exec")
    File "/home/ec2-user/actions-runner/_work/torchchat/torchchat/test-infra/.github/scripts/run_with_env_secrets.py", line 39, in run_cmd_or_die
      raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}")
  RuntimeError: Command docker exec -t 893d7a50c1dac64601612a05ccaaf09925e9c5ff194d487c18889bfd84f813f5 /exec failed with exit code 1
  Error: Process completed with exit code 1.

@Jack-Khuu @angelayi

try to install pip & pip3

debug which pip || true which pip3 || true which conda || true

debug info ``` which pip || true which pip3 || true which conda || true ```

use group size 32 which works on all models

Cleanup, comment non-working tests

Uncomment test code requiring unavailable pip3

comment non-working tests

comment out test code requiring pip3

Avoid nested quotes

Enable distributed test

Remove extraneous debug messages from install_requirements.sh

remove debug

Comment out failing quantization-any (glibc version issue) and distributed (nccl usage)

mikekgfb · 2025-02-18T17:40:09Z

@Jack-Khuu can you please restart the tests?

I fixed the runner for x86 (cpu/gpu) and enabled the first few basic tests. Two issues pip/pip3 did not exist and error status was returned. A few of the tests work now reliably when error status checked properly and pip available.

While errors were ignored (see #1315), all sorts of other tests started to fail (the usual bit rot when tests don't actually execute). Once the first few are enabled, we can than look at tests individually and asign fails to subsystem owners (nobody will feel compelled to fix things when 30+ tests fail because how can they know it's their problem and not some underlying common issue?)

mikekgfb · 2025-02-18T18:44:10Z

@Jack-Khuu Is torchao working elsewhere?

  WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
  Processing ./third-party/ao
    Preparing metadata (setup.py) ... 25l-� �\� �done
  25hBuilding wheels for collected packages: torchao
    Building wheel for torchao (setup.py) ... 25l-� �\� �|� �/� �-� �\� �|� �error
    error: subprocess-exited-with-error
    
    × python setup.py bdist_wheel did not run successfully.
    │ exit code: 1
    ╰─> [2407 lines of output]
[...]
          File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 734, in unix_wrap_ninja_compile
            _write_ninja_file_and_compile_objects(
          File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1899, in _write_ninja_file_and_compile_objects
            _run_ninja_build(
          File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2256, in _run_ninja_build
            raise RuntimeError(message) from e
        RuntimeError: Error compiling objects for extension
        [end of output]
    
    note: This error originates from a subprocess, and is likely not a problem with pip.
    ERROR: Failed building wheel for torchao
  25h  Running setup.py clean for torchao

mikekgfb · 2025-02-18T21:26:37Z

@Jack-Khuu readme and quantization both failing with torchao build issues. -- I thought we previously were downloading prebuilt torchao? In any event, same issue as previously.

    note: This error originates from a subprocess, and is likely not a problem with pip.
    ERROR: Failed building wheel for torchao
  25h  Running setup.py clean for torchao

We might disable these tests (which means we'd have disabled all tests!) and land the framework for testing commands from documentation, and then create separate PRs for each test that we can assign for subject matter experts? Let me know how you;d like to proceed.

mikekgfb · 2025-02-18T21:29:09Z

@Jack-Khuu readme and quantization both failing with torchao build issues. -- I thought we previously were downloading prebuilt torchao? In any event, same issue as previously.

    note: This error originates from a subprocess, and is likely not a problem with pip.
    ERROR: Failed building wheel for torchao
  25h  Running setup.py clean for torchao

We might disable these tests (which means we'd have disabled all tests!) and land the framework for testing commands from documentation, and then create separate PRs for each test that we can assign for subject matter experts? Let me know how you'd like to proceed.

Jack-Khuu · 2025-02-18T22:56:15Z

Is torchao working elsewhere?

Working as in wheels are being built properly?
Here's one https://github.com/pytorch/torchchat/actions/runs/13397193132/job/37420149243?pr=1476

I thought we previously were downloading prebuilt torchao?

Previously we were pinned to major releases from AO, but when we moved to using AO nightly (which doesn't have macos builds), we build from source

torchchat/install/install_requirements.sh

Line 122 in b57b2be

    
           $PIP_EXECUTABLE install git+https://github.com/pytorch/ao.git@2f97b0955953fa1a46594a27f0df2bc48d93e79d

We might disable these tests (which means we'd have disabled all tests!)

Since the README tests are giving us no signal; disabling them is honestly the best move. We can use this PR to do so: Disable all README tests that provide no signal

land the framework for testing commands from documentation, and
then create separate PRs for each test that we can assign for subject matter experts?

This sounds like a working approach. Let's get a fresh Issue spun up with details so we can link the subject matter experts to it

Disable remaining tests

enable readme

remove run of readme

Jack-Khuu · 2025-02-20T21:38:58Z

The failing tests are failing on main so I'm looking into it; hold tight

Jack-Khuu

Thanks again for all this

Jack-Khuu · 2025-02-21T20:30:38Z

docs/quantization.md

@@ -82,17 +82,17 @@ Here are some examples of quantization configurations
  ```
 * Only quantize linear layers
  ```
-  --quantize '{"linear:a8w4dq": {"groupsize" : 256}}'
+  --quantize '{"linear:a8w4dq": {"groupsize" : 32}}'


Just to jog my memory: Is the only reason we use 32 here instead of 256 is so that the stories model works?

Update run-readme-pr-macos.yml

b716bf6

source test commands instead of executing them. (Possible fix for pytorch#1315 )

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 24, 2025

mikekgfb and others added 3 commits January 23, 2025 18:38

Update run-docs

b0deb2a

source instead of exec

Merge branch 'main' into patch-43

35dbb95

Update README.md

b7af6b9

somebody pushed all the model exports into exportedModels, but... we never create the directory. we should do that also do this in the user instructions, just because storing into a directory that doesn't exist is not good :)

mikekgfb added 7 commits January 24, 2025 12:52

Update multimodal.md

6ea3d55

multimodal doc needed end of tests comment.

Merge branch 'pytorch:main' into patch-43

278b3fc

Update ADVANCED-USERS.md

2e6b5ae

Need to download files before using them, lol. We expect the users to do this, but we should verbalize. Plus, if we extract for testing, then it obviously fails.

Merge branch 'main' into patch-43

049418d

Update native-execution.md

52fd00b

( triggers unexpected token in macos zsh

Update run-readme-pr-macos.yml

76f7edf

# metadata does not install properly on macos # .ci/scripts/run-docs multimodal

Update run-readme-pr-mps.yml

da9a92a

# metadata does not install properly on macos # .ci/scripts/run-docs multimodal

mikekgfb added 4 commits January 24, 2025 17:16

Update ADVANCED-USERS.md

3e4ad3d

install wget

Update run-readme-pr-macos.yml

c2cb227

echo ".ci/scripts/run-docs native DISABLED" # .ci/scripts/run-docs native

Update run-readme-pr-mps.yml

72702f0

echo ".ci/scripts/run-docs native DISABLED" # .ci/scripts/run-docs native

Merge branch 'main' into patch-43

170729b

mikekgfb added 3 commits January 28, 2025 15:58

Update run-docs

79c4a23

switch to gs=32 quantization (requires consolidated run-docs of pytorch#1439)

Create cuda-32.json

ed702af

add gs=32 cuda quantization for use w/ stories15M

Create mobile-32.json

286bb08

add gs=32 for stories15M

mikekgfb added 15 commits January 31, 2025 17:44

Update run-readme-pr.yml

2a18f0d

try to install pip & pip3

Update run-readme-pr.yml

30746fc

debug which pip || true which pip3 || true which conda || true

Update run-readme-pr-macos.yml

fb4e0dd

Update run-readme-pr-linuxaarch64.yml

7786b84

debug info ``` which pip || true which pip3 || true which conda || true ```

Update quantization.md

227e608

use group size 32 which works on all models

Update run-readme-pr.yml

8a349c6

Cleanup, comment non-working tests

Update run-readme-pr-macos.yml

c7878d4

Uncomment test code requiring unavailable pip3

Update run-readme-pr-mps.yml

7656d69

comment non-working tests

Update run-readme-pr-linuxaarch64.yml

d6aa5d5

comment out test code requiring pip3

Merge branch 'main' into patch-43

b5edbc8

Update run-docs

afc2be7

Avoid nested quotes

Update run-readme-pr.yml

39be079

Enable distributed test

Update install_requirements.sh

06bf002

Remove extraneous debug messages from install_requirements.sh

Update install_requirements.sh

ba9e855

remove debug

Update run-readme-pr.yml

26f629e

Comment out failing quantization-any (glibc version issue) and distributed (nccl usage)

Merge branch 'main' into patch-43

6e51132

mikekgfb added 4 commits February 18, 2025 21:17

Update run-readme-pr.yml

396dfa7

Disable remaining tests

Merge branch 'main' into patch-43

1c4dd3b

Update run-readme-pr.yml

bfccb73

enable readme

Update run-readme-pr.yml

a913900

remove run of readme

Merge branch 'main' into patch-43

6a5c3a7

Jack-Khuu approved these changes Feb 21, 2025

View reviewed changes

Jack-Khuu merged commit 2766a95 into pytorch:main Feb 21, 2025
70 of 71 checks passed

mikekgfb mentioned this pull request Feb 24, 2025

use template from pull.yml for run-readme-pr.yml since pip3 not available in current environment #1489

Closed

source instead of exec in run-readme-pr-macos.yml #1476

source instead of exec in run-readme-pr-macos.yml #1476

Uh oh!

Conversation

mikekgfb commented Jan 24, 2025

Uh oh!

pytorch-bot bot commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1476

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

mikekgfb commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikekgfb commented Jan 24, 2025

Uh oh!

Jack-Khuu commented Jan 24, 2025

Uh oh!

mikekgfb commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikekgfb commented Jan 25, 2025

Uh oh!

mikekgfb commented Jan 28, 2025

Uh oh!

Jack-Khuu commented Jan 28, 2025

Uh oh!

mikekgfb commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikekgfb commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikekgfb commented Jan 29, 2025

Uh oh!

mikekgfb commented Feb 18, 2025

Uh oh!

mikekgfb commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikekgfb commented Feb 18, 2025

Uh oh!

mikekgfb commented Feb 18, 2025

Uh oh!

Jack-Khuu commented Feb 18, 2025

Uh oh!

Jack-Khuu commented Feb 20, 2025

Uh oh!

Jack-Khuu left a comment

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 24, 2025 •

edited

Loading

mikekgfb commented Jan 24, 2025 •

edited

Loading

mikekgfb commented Jan 24, 2025 •

edited

Loading

mikekgfb commented Jan 28, 2025 •

edited

Loading

mikekgfb commented Jan 28, 2025 •

edited

Loading

mikekgfb commented Feb 18, 2025 •

edited

Loading