Tecorigin sdaa accelerator #6903

siqi654321 · 2024-12-23T02:21:38Z

Description
This PR includes Tecorigin SDAA accelerator support.
With this PR, DeepSpeed supports SDAA as backend for training tasks.

tjruwase · 2025-01-22T22:28:43Z

@siqi654321, is this ready for review?

siqi654321 · 2025-02-06T09:02:41Z

@tjruwase Yes, it's ready for review. And Tecorigin SDAA is a AI processor that support AI frameworks like PyTorch, etc. It‘s possible to run Transformers/Accelerate/DeepSpeed on SDAA to train foundation model. Website: http://www.tecorigin.com/

tjruwase · 2025-02-06T15:32:53Z

@siqi654321, please address the DCO requirement

Signed-off-by: siqi <[email protected]>

siqi654321 · 2025-02-07T03:04:59Z

@tjruwase The DCO problem has been solved.

tjruwase · 2025-02-07T11:26:57Z

@siqi654321, please consider doing the following as appropriate:

@loadams, did I miss anything?

tjruwase · 2025-02-07T11:27:46Z

@siqi654321, the following can help fix the Formatting issue
https://github.com/deepspeedai/DeepSpeed/blob/master/CONTRIBUTING.md#prerequisites

Signed-off-by: siqi <[email protected]>

siqi654321 · 2025-02-08T02:14:34Z

@tjruwase Thx for help. It seems that the formatting error has been fixed. I would also like to ask whether the additional documentation you mentioned is necessary. I see that some accelerators, such as mlu, do not provide these documents.

Signed-off-by: siqi <[email protected]>

tjruwase · 2025-02-08T09:07:59Z

@tjruwase Thx for help. It seems that the formatting error has been fixed. I would also like to ask whether the additional documentation you mentioned is necessary.

@siqi654321, the additional documentation is completely optional. This PR will merge once CI completes.

siqi654321 · 2025-02-10T01:48:49Z

@tjruwase I found that the xpu-max1100 test has failed, but it doesn't seem to be related to my changes. Could you help take a look at this issue?

tjruwase · 2025-02-11T23:40:44Z

@delock, this CI failure occurs sporadically but seems unrelated to the PRs. Can you help investigate? Thanks!

https://github.com/deepspeedai/DeepSpeed/actions/runs/13260306651/job/37017257486?pr=6903

loadams · 2025-02-12T00:43:58Z

@siqi654321, please consider doing the following as appropriate:

Reviewing https://www.deepspeed.ai/tutorials/accelerator-abstraction-interface/

Updating https://www.deepspeed.ai/tutorials/accelerator-setup-guide/

Updating https://github.com/deepspeedai/DeepSpeed?tab=readme-ov-file#contributed-hw-support

Updating https://github.com/deepspeedai/DeepSpeed?tab=readme-ov-file#build-pipeline-status

@loadams, did I miss anything?

I think this covers everything, the README and the accelerator setup guide being the most important.

delock · 2025-02-12T05:30:37Z

@delock, this CI failure occurs sporadically but seems unrelated to the PRs. Can you help investigate? Thanks!

https://github.com/deepspeedai/DeepSpeed/actions/runs/13260306651/job/37017257486?pr=6903

@tjruwase we are updating firmware of this server to see whether the random failure goes away, at this time please ignore this error. Thanks!

siqi654321 · 2025-02-12T05:37:05Z

@tjruwase The tests "xpu-max1100" and "nv-torch-latest-v100" appear to be executed on the "xpu" and "cuda" accelerators respectively. However, I have confirmed that my modifications did not impact these two accelerators, which is quite puzzling. Is it possible to run the tests on a different CI machine? Or are there any other methods I can assist with to help troubleshoot the issue?

loadams · 2025-02-12T16:04:03Z

@tjruwase The tests "xpu-max1100" and "nv-torch-latest-v100" appear to be executed on the "xpu" and "cuda" accelerators respectively. However, I have confirmed that my modifications did not impact these two accelerators, which is quite puzzling. Is it possible to run the tests on a different CI machine? Or are there any other methods I can assist with to help troubleshoot the issue?

@siqi654321 - nothing you need to do to troubleshoot the issue, we will check on the tests and ensure they pass or will mark them as not required if needed. I think you could just make the changes to the README as discussed above.

loadams · 2025-02-19T16:58:34Z

@siqi654321 - did you want to update the README to indicate the SDAA is a supported accelerator?

siqi654321 · 2025-02-20T07:20:17Z

@loadams Is it possible to merge the code first in order to quickly support our ecosystem users? We will consider adding documentation related to this accelerator in subsequent pull requests.

Description This PR includes Tecorigin SDAA accelerator support. With this PR, DeepSpeed supports SDAA as backend for training tasks. --------- Signed-off-by: siqi <[email protected]> Co-authored-by: siqi <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>

Description This PR includes Tecorigin SDAA accelerator support. With this PR, DeepSpeed supports SDAA as backend for training tasks. --------- Signed-off-by: siqi <[email protected]> Co-authored-by: siqi <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Max Kovalenko <[email protected]>

Description This PR includes Tecorigin SDAA accelerator support. With this PR, DeepSpeed supports SDAA as backend for training tasks. --------- Signed-off-by: siqi <[email protected]> Co-authored-by: siqi <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Description This PR includes Tecorigin SDAA accelerator support. With this PR, DeepSpeed supports SDAA as backend for training tasks. --------- Signed-off-by: siqi <[email protected]> Co-authored-by: siqi <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Description This PR includes Tecorigin SDAA accelerator support. With this PR, DeepSpeed supports SDAA as backend for training tasks. --------- Signed-off-by: siqi <[email protected]> Co-authored-by: siqi <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

Description This PR includes Tecorigin SDAA accelerator support. With this PR, DeepSpeed supports SDAA as backend for training tasks. --------- Signed-off-by: siqi <[email protected]> Co-authored-by: siqi <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: yisheng <[email protected]>

siqi654321 requested review from loadams, tjruwase and jomayeri as code owners December 23, 2024 02:21

deepspeedai deleted a comment from microsoft-github-policy-service bot Feb 5, 2025

siqi654321 force-pushed the Tecorigin-SDAA-accelerator branch from 614eea9 to 82ca0ea Compare February 7, 2025 02:16

siqi654321 requested review from tohtana, GuanhuaWang and hwchen2017 as code owners February 7, 2025 02:16

siqi added 2 commits February 7, 2025 10:50

[Accelerator] Tecorgin SDAA support

3f42a1d

Signed-off-by: siqi <[email protected]>

Add Tecorigin License

38e4d50

Signed-off-by: siqi <[email protected]>

siqi654321 force-pushed the Tecorigin-SDAA-accelerator branch from 82ca0ea to 38e4d50 Compare February 7, 2025 02:55

Merge branch 'master' into Tecorigin-SDAA-accelerator

03aecee

Fix format error

32e6162

Signed-off-by: siqi <[email protected]>

siqi and others added 2 commits February 8, 2025 10:38

fix formatting error

ab21cd7

Signed-off-by: siqi <[email protected]>

Merge branch 'master' into Tecorigin-SDAA-accelerator

943e073

Merge branch 'master' into Tecorigin-SDAA-accelerator

15e1b97

Merge branch 'master' into Tecorigin-SDAA-accelerator

88928dd

tjruwase approved these changes Feb 12, 2025

View reviewed changes

loadams added 4 commits February 12, 2025 15:24

Merge branch 'master' into Tecorigin-SDAA-accelerator

af67ab6

Merge branch 'master' into Tecorigin-SDAA-accelerator

9241d6c

Merge branch 'master' into Tecorigin-SDAA-accelerator

0729ec7

Merge branch 'master' into Tecorigin-SDAA-accelerator

561a0db

tjruwase added this pull request to the merge queue Feb 20, 2025

Merged via the queue into deepspeedai:master with commit cb20d44 Feb 21, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tecorigin sdaa accelerator #6903

Tecorigin sdaa accelerator #6903

siqi654321 commented Dec 23, 2024

tjruwase commented Jan 22, 2025

siqi654321 commented Feb 6, 2025

tjruwase commented Feb 6, 2025

siqi654321 commented Feb 7, 2025

tjruwase commented Feb 7, 2025

tjruwase commented Feb 7, 2025

siqi654321 commented Feb 8, 2025

tjruwase commented Feb 8, 2025

siqi654321 commented Feb 10, 2025

tjruwase commented Feb 11, 2025 •

edited

Loading

loadams commented Feb 12, 2025

delock commented Feb 12, 2025

siqi654321 commented Feb 12, 2025

loadams commented Feb 12, 2025 •

edited

Loading

loadams commented Feb 19, 2025

siqi654321 commented Feb 20, 2025

Tecorigin sdaa accelerator #6903

Tecorigin sdaa accelerator #6903

Conversation

siqi654321 commented Dec 23, 2024

tjruwase commented Jan 22, 2025

siqi654321 commented Feb 6, 2025

tjruwase commented Feb 6, 2025

siqi654321 commented Feb 7, 2025

tjruwase commented Feb 7, 2025

tjruwase commented Feb 7, 2025

siqi654321 commented Feb 8, 2025

tjruwase commented Feb 8, 2025

siqi654321 commented Feb 10, 2025

tjruwase commented Feb 11, 2025 • edited Loading

loadams commented Feb 12, 2025

delock commented Feb 12, 2025

siqi654321 commented Feb 12, 2025

loadams commented Feb 12, 2025 • edited Loading

loadams commented Feb 19, 2025

siqi654321 commented Feb 20, 2025

tjruwase commented Feb 11, 2025 •

edited

Loading

loadams commented Feb 12, 2025 •

edited

Loading