Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tecorigin sdaa accelerator #6903

Merged
merged 12 commits into from
Feb 21, 2025

Conversation

siqi654321
Copy link
Contributor

Description
This PR includes Tecorigin SDAA accelerator support.
With this PR, DeepSpeed supports SDAA as backend for training tasks.

@tjruwase
Copy link
Contributor

@siqi654321, is this ready for review?

@siqi654321
Copy link
Contributor Author

@tjruwase Yes, it's ready for review. And Tecorigin SDAA is a AI processor that support AI frameworks like PyTorch, etc. It‘s possible to run Transformers/Accelerate/DeepSpeed on SDAA to train foundation model. Website: http://www.tecorigin.com/

@tjruwase
Copy link
Contributor

tjruwase commented Feb 6, 2025

@siqi654321, please address the DCO requirement
image

@siqi654321 siqi654321 force-pushed the Tecorigin-SDAA-accelerator branch from 82ca0ea to 38e4d50 Compare February 7, 2025 02:55
@siqi654321
Copy link
Contributor Author

@tjruwase The DCO problem has been solved.

@tjruwase
Copy link
Contributor

tjruwase commented Feb 7, 2025

@tjruwase
Copy link
Contributor

tjruwase commented Feb 7, 2025

@siqi654321, the following can help fix the Formatting issue
https://github.com/deepspeedai/DeepSpeed/blob/master/CONTRIBUTING.md#prerequisites

Signed-off-by: siqi <[email protected]>
@siqi654321
Copy link
Contributor Author

@tjruwase Thx for help. It seems that the formatting error has been fixed. I would also like to ask whether the additional documentation you mentioned is necessary. I see that some accelerators, such as mlu, do not provide these documents.

@tjruwase
Copy link
Contributor

tjruwase commented Feb 8, 2025

@tjruwase Thx for help. It seems that the formatting error has been fixed. I would also like to ask whether the additional documentation you mentioned is necessary.

@siqi654321, the additional documentation is completely optional. This PR will merge once CI completes.

@siqi654321
Copy link
Contributor Author

@tjruwase I found that the xpu-max1100 test has failed, but it doesn't seem to be related to my changes. Could you help take a look at this issue?

@tjruwase
Copy link
Contributor

tjruwase commented Feb 11, 2025

@delock, this CI failure occurs sporadically but seems unrelated to the PRs. Can you help investigate? Thanks!
image

https://github.com/deepspeedai/DeepSpeed/actions/runs/13260306651/job/37017257486?pr=6903

@loadams
Copy link
Collaborator

loadams commented Feb 12, 2025

@siqi654321, please consider doing the following as appropriate:

  1. Reviewing https://www.deepspeed.ai/tutorials/accelerator-abstraction-interface/
  2. Updating https://www.deepspeed.ai/tutorials/accelerator-setup-guide/
  3. Updating https://github.com/deepspeedai/DeepSpeed?tab=readme-ov-file#contributed-hw-support
  4. Updating https://github.com/deepspeedai/DeepSpeed?tab=readme-ov-file#build-pipeline-status

@loadams, did I miss anything?

I think this covers everything, the README and the accelerator setup guide being the most important.

@delock
Copy link
Collaborator

delock commented Feb 12, 2025

@delock, this CI failure occurs sporadically but seems unrelated to the PRs. Can you help investigate? Thanks! image

https://github.com/deepspeedai/DeepSpeed/actions/runs/13260306651/job/37017257486?pr=6903

@tjruwase we are updating firmware of this server to see whether the random failure goes away, at this time please ignore this error. Thanks!

@siqi654321
Copy link
Contributor Author

@tjruwase The tests "xpu-max1100" and "nv-torch-latest-v100" appear to be executed on the "xpu" and "cuda" accelerators respectively. However, I have confirmed that my modifications did not impact these two accelerators, which is quite puzzling. Is it possible to run the tests on a different CI machine? Or are there any other methods I can assist with to help troubleshoot the issue?

@loadams
Copy link
Collaborator

loadams commented Feb 12, 2025

@tjruwase The tests "xpu-max1100" and "nv-torch-latest-v100" appear to be executed on the "xpu" and "cuda" accelerators respectively. However, I have confirmed that my modifications did not impact these two accelerators, which is quite puzzling. Is it possible to run the tests on a different CI machine? Or are there any other methods I can assist with to help troubleshoot the issue?

@siqi654321 - nothing you need to do to troubleshoot the issue, we will check on the tests and ensure they pass or will mark them as not required if needed. I think you could just make the changes to the README as discussed above.

@loadams
Copy link
Collaborator

loadams commented Feb 19, 2025

@siqi654321 - did you want to update the README to indicate the SDAA is a supported accelerator?

@siqi654321
Copy link
Contributor Author

@loadams Is it possible to merge the code first in order to quickly support our ecosystem users? We will consider adding documentation related to this accelerator in subsequent pull requests.

@tjruwase tjruwase added this pull request to the merge queue Feb 20, 2025
Merged via the queue into deepspeedai:master with commit cb20d44 Feb 21, 2025
11 checks passed
Yejing-Lai pushed a commit to Yejing-Lai/DeepSpeed that referenced this pull request Feb 24, 2025
Description
This PR includes Tecorigin SDAA accelerator support.
With this PR, DeepSpeed supports SDAA as backend for training tasks.

---------

Signed-off-by: siqi <[email protected]>
Co-authored-by: siqi <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
deepcharm pushed a commit to deepcharm/DeepSpeed that referenced this pull request Feb 26, 2025
Description
This PR includes Tecorigin SDAA accelerator support.
With this PR, DeepSpeed supports SDAA as backend for training tasks.

---------

Signed-off-by: siqi <[email protected]>
Co-authored-by: siqi <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Signed-off-by: Max Kovalenko <[email protected]>
deepcharm pushed a commit to deepcharm/DeepSpeed that referenced this pull request Feb 27, 2025
Description
This PR includes Tecorigin SDAA accelerator support.
With this PR, DeepSpeed supports SDAA as backend for training tasks.

---------

Signed-off-by: siqi <[email protected]>
Co-authored-by: siqi <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Signed-off-by: Max Kovalenko <[email protected]>
gyou2021 pushed a commit to gyou2021/DeepSpeed that referenced this pull request Feb 28, 2025
Description
This PR includes Tecorigin SDAA accelerator support.
With this PR, DeepSpeed supports SDAA as backend for training tasks.

---------

Signed-off-by: siqi <[email protected]>
Co-authored-by: siqi <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
tohtana pushed a commit that referenced this pull request Feb 28, 2025
Description
This PR includes Tecorigin SDAA accelerator support.
With this PR, DeepSpeed supports SDAA as backend for training tasks.

---------

Signed-off-by: siqi <[email protected]>
Co-authored-by: siqi <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
shenzheyu pushed a commit to shenzheyu/DeepSpeed that referenced this pull request Mar 5, 2025
Description
This PR includes Tecorigin SDAA accelerator support.
With this PR, DeepSpeed supports SDAA as backend for training tasks.

---------

Signed-off-by: siqi <[email protected]>
Co-authored-by: siqi <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Signed-off-by: Zheyu SHEN <[email protected]>
ys950902 pushed a commit to ys950902/DeepSpeed that referenced this pull request Mar 6, 2025
Description
This PR includes Tecorigin SDAA accelerator support.
With this PR, DeepSpeed supports SDAA as backend for training tasks.

---------

Signed-off-by: siqi <[email protected]>
Co-authored-by: siqi <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Signed-off-by: yisheng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants