Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source instead of exec in run-readme-pr-macos.yml #1476

Open
wants to merge 36 commits into
base: main
Choose a base branch
from

Conversation

mikekgfb
Copy link
Contributor

source test commands instead of executing them.
(Possible fix for #1315 )

source test commands instead of executing them.  
(Possible fix for pytorch#1315 )
Copy link

pytorch-bot bot commented Jan 24, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1476

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 18 New Failures

As of commit 7786b84 with merge base 162a38b (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 24, 2025
mikekgfb and others added 3 commits January 23, 2025 18:38
source instead of exec
somebody pushed all the model exports into exportedModels, but... we never create the directory.

we should do that also do this in the user instructions, just because storing into a directory that doesn't exist is not good :)
@mikekgfb
Copy link
Contributor Author

mikekgfb commented Jan 24, 2025

@Jack-Khuu when it rains it pours, that showed more false positives per #1315 than anybody could anticipate!

I added the directory that all the examples use (but don't create!), or we can just removed the directory...

@mikekgfb
Copy link
Contributor Author

@Jack-Khuu Ideally we start a run for 1476, and in parallel commit 1409, 1410, 1417, 1439, 1455, 1466.
1476 may lead to little changes in test setup etc, and it’ll avoid difficult merges with cleanup and alignment that’s 1409-1466. Wdyt?

PS: In a nutshell, failures for the doc based runs haven't bubbled up because a failure inside a shell script that's executed with bash did not seem to pass failure information to upstream. Using source to run the multiple layers rectifies this, and may be a pragmatic answer to restoring full test coverage. (I think right now we've cought some of the fails that have not bubbled up to hud.pytorch.org because of the exec/bash dichotomy by eye balling which is not a healthy long term solution.

@Jack-Khuu
Copy link
Contributor

Yup, making my way through those CI PR's then we'll rebase this one

Our current coverage has plenty of gaps and has honestly been adhoc, so revamping the CI and creating a comprehensive unittest system is a P0 KR for us this half (working on it with @Gasoonjia).

Thanks again for grinding through these!!

@mikekgfb
Copy link
Contributor Author

mikekgfb commented Jan 24, 2025

Yup, making my way through those CI PR's then we'll rebase this one

Our current coverage has plenty of gaps and has honestly been adhoc, so revamping the CI and creating a comprehensive unittest system is a P0 KR for us this half (working on it with @Gasoonjia).

Thanks again for grinding through these!!

pip3 not found. I guess we do conda for this environment. That's interesting. How do we deal with conda like that. or is it just pip vs pip3. (alias pip3 pip?)

https://github.com/pytorch/torchchat/actions/runs/12942557291/job/36137959147?pr=1476

multimodal doc needed end of tests comment.
Need to download files before using them, lol. We expect the users to do this, but we should verbalize.  Plus, if we extract for testing, then it obviously fails.
( triggers unexpected token in macos zsh
          # metadata does not install properly on macos
          # .ci/scripts/run-docs multimodal
          # metadata does not install properly on macos
          # .ci/scripts/run-docs multimodal
@mikekgfb
Copy link
Contributor Author

https://hud.pytorch.org/pr/pytorch/torchchat/1476#36153855033

conda: command not found

          echo ".ci/scripts/run-docs native DISABLED"
          # .ci/scripts/run-docs native
          echo ".ci/scripts/run-docs native DISABLED"
          # .ci/scripts/run-docs native
@mikekgfb
Copy link
Contributor Author

pip3 command not found. This is called from ./install/install_requirements.sh.

  Using python executable: python3
  Using pip executable: pip3
  + pip3 install -r install/requirements.txt --extra-index-url https://download.pytorch.org/whl/nightly/cpu
  ./install/install_requirements.sh: line 101: pip3: command not found
  ++ handle_error

@Jack-Khuu

@Jack-Khuu
Copy link
Contributor

Looks like #1362 for the mismatched group size is finally marked as failing properly

cc: @Gasoonjia

@mikekgfb
Copy link
Contributor Author

mikekgfb commented Jan 28, 2025

Some issues: we can't find pip3 and/or conda.

https://github.com/pytorch/torchchat/actions/runs/12996559809/job/36252363658
test-gguf-cpu / linux job:

  Using pip executable: pip3
  + pip3 install -r install/requirements.txt --extra-index-url https://download.pytorch.org/whl/nightly/cpu
  ./install/install_requirements.sh: line 101: pip3: command not found

https://ossci-raw-job-status.s3.amazonaws.com/log/pytorch/torchchat/36252362180
test-evaluation-cpu / linux-job:

2025-01-27T21:31:18.6200400Z adae5085138d826cf1da08057e639b566f6a5a9fa338be7e4d6a1b5e7bac14fc
2025-01-27T21:31:18.6201409Z Running command: docker exec -t adae5085138d826cf1da08057e639b566f6a5a9fa338be7e4d6a1b5e7bac14fc /exec
2025-01-27T21:31:18.6202293Z /exec: line 3: conda: command not found

https://ossci-raw-job-status.s3.amazonaws.com/log/pytorch/torchchat/36252360312
test-readme-cpu / linux-job

2025-01-27T21:31:18.6200400Z adae5085138d826cf1da08057e639b566f6a5a9fa338be7e4d6a1b5e7bac14fc
2025-01-27T21:31:18.6201409Z Running command: docker exec -t adae5085138d826cf1da08057e639b566f6a5a9fa338be7e4d6a1b5e7bac14fc /exec
2025-01-27T21:31:18.6202293Z /exec: line 3: conda: command not found

@Jack-Khuu

@mikekgfb
Copy link
Contributor Author

mikekgfb commented Jan 28, 2025

The following is an issue specific to the use of stories... because the features aren't multiple of 256 groupsize. Originally, I had included padding or otherwise support for handling this (Embeddings quantization just handles an "unfull" group for example). Since moving to torchao we're insisting on the multiplicity of the features size and group size.

    File "/opt/conda/lib/python3.11/site-packages/torchao/quantization/GPTQ.py", line 1142, in _create_quantized_state_dict
      in_features % self.groupsize == 0
  AssertionError: require in_features:288 % self.groupsize:256 == 0

Options:
1 - FInd another small model we might use in lieu of stories15M that meets the divisible by 256 threshhold
2 - We can rewrite to groupsize 32 for the tests. But of course that won't test the groupsize 256 that we point users to do for the larger models.
3 -Revisit support for non-multiple groupwise quantization in torchao.

I'll assume that (3) might take a while for discussion and implementation and going with (1) or (2) is probably the pragmatic solution. (With the caveat that (2) won't test gs=256, but it may be the quickest to implement and not sure what the smallest model for (1) is. (I haven't looked at Stories 110M which may be an acceptable stand-in re: feature sizes being multiple of 256, although it will drive up runtime of our tests....)

Resolved via (2) for now

switch to gs=32 quantization
(requires consolidated run-docs of pytorch#1439)
add gs=32 cuda quantization for use w/ stories15M
add gs=32 for stories15M
@mikekgfb
Copy link
Contributor Author

https://github.com/pytorch/torchchat/actions/runs/13021784616/job/36327025188?pr=1476

test-advanced-any

  W0129 01:52:50.181000 13569 site-packages/torch/_export/__init__.py:67] +============================+
  W0129 01:52:50.181000 13569 site-packages/torch/_export/__init__.py:68] |     !!!   WARNING   !!!    |
  W0129 01:52:50.181000 13569 site-packages/torch/_export/__init__.py:69] +============================+
  W0129 01:52:50.181000 13569 site-packages/torch/_export/__init__.py:70] torch._export.aot_compile()/torch._export.aot_load() is being deprecated, please switch to directly calling torch._inductor.aoti_compile_and_package(torch.export.export())/torch._inductor.aoti_load_package() instead.
  Traceback (most recent call last):
    File "/pytorch/torchchat/torchchat/cli/builder.py", line 564, in _initialize_model
      model.forward = torch._export.aot_load(
                      ^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.11/site-packages/torch/_export/__init__.py", line 163, in aot_load
  Traceback (most recent call last):
      runner = torch._C._aoti.AOTIModelContainerRunnerCpu(so_path, 1)  # type: ignore[call-arg]
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  RuntimeError: Error in dlopen: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /pytorch/torchchat/stories15M.so)
  
  During handling of the above exception, another exception occurred:
  
  Traceback (most recent call last):
    File "/pytorch/torchchat/torchchat.py", line 96, in <module>
      generate_main(args)
    File "/pytorch/torchchat/torchchat/generate.py", line 1651, in main
      run_generator(args)
    File "/pytorch/torchchat/torchchat/generate.py", line 1619, in run_generator
      gen = Generator(
            ^^^^^^^^^^
    File "/pytorch/torchchat/torchchat/generate.py", line 381, in __init__
      self.model = _initialize_model(self.builder_args, self.quantize, self.tokenizer)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/pytorch/torchchat/torchchat/cli/builder.py", line 568, in _initialize_model
      raise RuntimeError(f"Failed to load AOTI compiled {builder_args.dso_path}")
  RuntimeError: Failed to load AOTI compiled stories15M.so
    File "/home/ec2-user/actions-runner/_work/torchchat/torchchat/test-infra/.github/scripts/run_with_env_secrets.py", line 102, in <module>
      main()
    File "/home/ec2-user/actions-runner/_work/torchchat/torchchat/test-infra/.github/scripts/run_with_env_secrets.py", line 98, in main
      run_cmd_or_die(f"docker exec -t {container_name} /exec")
    File "/home/ec2-user/actions-runner/_work/torchchat/torchchat/test-infra/.github/scripts/run_with_env_secrets.py", line 39, in run_cmd_or_die
      raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}")
  RuntimeError: Command docker exec -t 893d7a50c1dac64601612a05ccaaf09925e9c5ff194d487c18889bfd84f813f5 /exec failed with exit code 1
  Error: Process completed with exit code 1.

@Jack-Khuu @angelayi

@Jack-Khuu Jack-Khuu added Compile / AOTI Issues related to AOT Inductor and torch compile CI Infra Issues related to CI infrastructure and setup triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jan 30, 2025
Comment out tests that currently fail, as per summary in PR comments
Dump location of executable to understand these errors:
https://hud.pytorch.org/pr/pytorch/torchchat/1476#36452260294

2025-01-31T00:18:57.1405698Z + pip3 install -r install/requirements.txt --extra-index-url https://download.pytorch.org/whl/nightly/cpu
2025-01-31T00:18:57.1406689Z ./install/install_requirements.sh: line 101: pip3: command not found
dump candidate locations for pip
Some of the updown commands were getting rendered. Not sure why/when that happens?
readme switched from llama3 to llama3.1, so replace llama3.1 with stories15M
remove failing gguf test
Remove failing gguf test
@mikekgfb
Copy link
Contributor Author

Do we need to add elements to PATH? Or do we need to install some flavor of pip? (We also seem to be missing conda in some places, but pip appears to be the dominant failure mode)

From https://hud.pytorch.org/pr/pytorch/torchchat/1476#36495684060
X test-evaluation-cpu / linux-job

2025-01-31T18:07:47.9704998Z Using python executable: python3
2025-01-31T18:07:47.9705436Z located at /usr/bin/python3
2025-01-31T18:07:47.9705863Z Using pip executable: pip3
2025-01-31T18:07:47.9706883Z which: no pip3 in (/opt/rh/gcc-toolset-11/root/usr/bin:/opt/rh/gcc-toolset-14/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)
2025-01-31T18:07:47.9707935Z located at not found
2025-01-31T18:07:47.9708171Z 
2025-01-31T18:07:47.9708341Z possible pip candidates are:
2025-01-31T18:07:47.9709360Z which: no pip in (/opt/rh/gcc-toolset-11/root/usr/bin:/opt/rh/gcc-toolset-14/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)
2025-01-31T18:07:47.9710426Z pip is located at not found
2025-01-31T18:07:47.9711432Z which: no pip3 in (/opt/rh/gcc-toolset-11/root/usr/bin:/opt/rh/gcc-toolset-14/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)
2025-01-31T18:07:47.9712490Z pip3 is located at not found
2025-01-31T18:07:47.9713593Z which: no pip{PYTHON_SYS_VERSION} in (/opt/rh/gcc-toolset-11/root/usr/bin:/opt/rh/gcc-toolset-14/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)
2025-01-31T18:07:47.9714803Z pip{PYTHON_SYS_VERSION} is located at not found

@Jack-Khuu

Can we mix `steps:` with `script: |` in git workflows?

Testing 123 testing!
@mikekgfb
Copy link
Contributor Author

mikekgfb commented Jan 31, 2025

The latest commit ^ adds some "setup steps" that include a "Python setup" and that we're using in pull.yml. Added this with the hope that it will materialize a better python installation with some flavor of pip.

remove quotes around replace as the nested quotes are not interpreted by the shall but seem to be passed to updown.py.

We don't have spaces in replace, so no need for escapes.
1 - Remove steps experiment.
2 - add at-get install pip3

Maybe releng needs to look at what's happening with pip?
remove quotes that mess up parameter identification.
try to install pip & pip3
debug

        which pip || true
        which pip3 || true
        which conda || true
debug info

```
        which pip || true
        which pip3 || true
        which conda || true
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Infra Issues related to CI infrastructure and setup CLA Signed This label is managed by the Meta Open Source bot. Compile / AOTI Issues related to AOT Inductor and torch compile triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants