Skip to content

Commit fb68c07

Browse files
Merge branch 'main' into dmoe_integration
2 parents 35c7225 + f532580 commit fb68c07

File tree

103 files changed

+6367
-748
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

103 files changed

+6367
-748
lines changed

Diff for: .github/workflows/cpu_ci_on_pr.yml renamed to .github/workflows/.cpu_ci_on_pr.yml

+3-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
# This file is hidden (.cpu_cpi_on_pr.yml) to minimize the number of runner minutes consumed.
2+
13
name: "Pull Request CPU Tests"
24

35
on:
@@ -7,7 +9,7 @@ on:
79

810
jobs:
911
run-tests:
10-
runs-on: [ 'test', 'self-hosted' ]
12+
runs-on: ubuntu-22.04 # ubuntu-latest currently points to ubuntu-22.04 but 24.04 is in beta - recommend testing on 24.04 and then changing instead of using ubuntu-latest
1113
steps:
1214
- name: Checkout Repository
1315
uses: actions/checkout@v4

Diff for: .github/workflows/coverity_scan.yml

+3-2
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,10 @@ jobs:
1717
runs-on: ubuntu-latest
1818

1919
env:
20-
COV_USER: ${{ secrets.COV_USER }}
20+
COV_USER: ${{ secrets.COV_USER }} # needs to be an email with access to the Coverity stream - add to secrets/actions
2121
COVERITY_PROJECT: ${{ secrets.COVERITY_PROJECT }}
22-
COVERITY_TOKEN: ${{ secrets.COVERITY_TOKEN }}
22+
COVERITY_TOKEN: ${{ secrets.COVERITY_TOKEN }} # you can get this token from Coverity stream dashboard:
23+
# https://scan.coverity.com/projects/<project>?tab=project_settings
2324

2425
steps:
2526
- uses: actions/checkout@v2

Diff for: .github/workflows/cpu_ci.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ on: "push"
55
jobs:
66
run-tests:
77
#runs-on: ubuntu-latest
8-
runs-on: [ 'test', 'self-hosted' ]
8+
runs-on: ubuntu-22.04
99
steps:
1010
- uses: actions/checkout@v3
1111

Diff for: .github/workflows/cpu_ci_dispatch.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ on:
1010

1111
jobs:
1212
run-tests:
13-
runs-on: [ 'test', 'self-hosted' ]
13+
runs-on: ubuntu-22.04
1414
steps:
1515
- name: Checkout Repository
1616
uses: actions/checkout@v4

Diff for: .github/workflows/pull_request.yml

+15-4
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
name: Pull Request
22

3-
on: [pull_request]
3+
#on: [pull_request, workflow_dispatch]
4+
on: workflow_dispatch
45

56
jobs:
67
pre-commit:
@@ -9,7 +10,7 @@ jobs:
910
- uses: actions/checkout@v2
1011
- uses: actions/setup-python@v4
1112
with:
12-
python-version: 3.10
13+
python-version: "3.10.14"
1314
cache: "pip"
1415
cache-dependency-path: "**/requirements*.txt"
1516
# Need the right version of clang-format
@@ -40,10 +41,20 @@ jobs:
4041
git commit -m "Update NeoXArgs docs automatically"
4142
git push
4243
run-tests:
43-
runs-on: self-hosted
44+
runs-on: ubuntu-22.04
4445
steps:
4546
- uses: actions/checkout@v2
47+
- uses: actions/setup-python@v4
48+
with:
49+
python-version: "3.10.13"
50+
cache-dependency-path: "**/requirements*.txt"
4651
- name: prepare data
47-
run: python prepare_data.py
52+
run: python3 prepare_data.py
53+
- name: install pytest
54+
run: python3 -m pip install pytest pytest-forked pyyaml requests wandb
55+
- name: install torch
56+
run: python3 -m pip install torch
57+
- name: install requirements
58+
run: pip install -r requirements/requirements.txt
4859
- name: Run Tests
4960
run: pytest --forked tests

Diff for: .pre-commit-config.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ repos:
3333
hooks:
3434
- id: codespell
3535
args: [
36-
'--ignore-words-list=reord,dout', # Word used in error messages that need rewording
36+
'--ignore-words-list=reord,dout,te', # Word used in error messages that need rewording. te --> transformerengine
3737
--check-filenames,
3838
--check-hidden,
3939
]

Diff for: README.md

+48-16
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,21 @@ GPT-NeoX leverages many of the same features and technologies as the popular Meg
1515
* Cutting edge architectural innovations including rotary and alibi positional embeddings, parallel feedforward attention layers, and flash attention.
1616
* Predefined configurations for popular architectures including Pythia, PaLM, Falcon, and LLaMA 1 \& 2
1717
* Curriculum Learning
18-
* Easy connections with the open source ecosystem, including Hugging Face's [tokenizers](https://github.com/huggingface/tokenizers) and [transformers](https://github.com/huggingface/transformers/) libraries, logging via [WandB](https://wandb.ai/site), and evaluation via our [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness).
18+
* Easy connections with the open source ecosystem, including Hugging Face's [tokenizers](https://github.com/huggingface/tokenizers) and [transformers](https://github.com/huggingface/transformers/) libraries, monitor experiments via [WandB](https://wandb.ai/site)/[Comet](https://www.comet.com/site/)/TensorBoard, and evaluation via our [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness).
1919

2020
## News
21+
**[9/9/2024]** We now support preference learning via [DPO](https://arxiv.org/abs/2305.18290), [KTO](https://arxiv.org/abs/2402.01306), and reward modeling
22+
23+
**[9/9/2024]** We now support integration with [Comet ML](https://www.comet.com/site/), a machine learning monitoring platform
24+
25+
**[5/21/2024]** We now support [RWKV](https://www.rwkv.com/) with pipeline parallelism!. See the PRs for [RWKV](https://github.com/EleutherAI/gpt-neox/pull/1198) and [RWKV+pipeline](https://github.com/EleutherAI/gpt-neox/pull/1221)
26+
27+
**[3/21/2024]** We now support Mixture-of-Experts (MoE)
28+
29+
**[3/17/2024]** We now support AMD MI250X GPUs
30+
31+
**[3/15/2024]** We now support [Mamba](https://github.com/state-spaces/mamba) with tensor parallelism! See [the PR](https://github.com/EleutherAI/gpt-neox/pull/1184)
32+
2133
**[8/10/2023]** We now support checkpointing with AWS S3! Activate with the `s3_path` config option (for more detail, see [the PR](https://github.com/EleutherAI/gpt-neox/pull/1010))
2234

2335
**[9/20/2023]** As of https://github.com/EleutherAI/gpt-neox/pull/1035, we have deprecated Flash Attention 0.x and 1.x, and migrated support to Flash Attention 2.x. We don't believe this will cause problems, but if you have a specific use-case that requires old flash support using the latest GPT-NeoX, please raise an issue.
@@ -88,14 +100,15 @@ Prior to 3/9/2023, GPT-NeoX relied on [DeeperSpeed](https://github.com/EleutherA
88100

89101
### Host Setup
90102

91-
First make sure you are in an environment with Python 3.8 with an appropriate version of PyTorch 1.8 or later installed. **Note:** Some of the libraries that GPT-NeoX depends on have not been updated to be compatible with Python 3.10+. Python 3.9 appears to work, but this codebase has been developed and tested for Python 3.8.
103+
This codebase has primarily developed and tested for Python 3.8-3.10, and PyTorch 1.8-2.0. This is not a strict requirement, and other versions and combinations of libraries may work.
92104

93105
To install the remaining basic dependencies, run:
94106

95107
```bash
96108
pip install -r requirements/requirements.txt
97109
pip install -r requirements/requirements-wandb.txt # optional, if logging using WandB
98110
pip install -r requirements/requirements-tensorboard.txt # optional, if logging via tensorboard
111+
pip install -r requirements/requirements-comet.txt # optional, if logging via Comet
99112
```
100113

101114
from the repository root.
@@ -294,7 +307,7 @@ You can then run any job you want from inside the container.
294307
Concerns when running for a long time or in detached mode include
295308
- You will have to terminate the container manually when you are no longer using it
296309
- If you want processes to continue running when your shell session ends, you will need to background them.
297-
- If you then want logging, you will have to make sure to pipe logs to disk or set up wandb.
310+
- If you then want logging, you will have to make sure to pipe logs to disk, and set up wandb and/or Comet logging.
298311

299312
If you prefer to run the prebuilt container image from dockerhub, you can run the docker compose commands with ```-f docker-compose-dockerhub.yml``` instead, e.g.,
300313

@@ -457,7 +470,7 @@ You can pass in an arbitrary number of configs which will all be merged at runti
457470

458471
You can also optionally pass in a config prefix, which will assume all your configs are in the same folder and append that prefix to their path.
459472

460-
E.G:
473+
For example:
461474

462475
```bash
463476
python ./deepy.py train.py -d configs 125M.yml local_setup.yml
@@ -574,15 +587,28 @@ To convert from a Hugging Face model into a NeoX-loadable, run `tools/ckpts/conv
574587
575588
# Monitoring
576589
577-
In addition to storing logs locally, we provide built-in support for two popular experiment monitoring frameworks: [Weights & Biases](https://wandb.ai/site) and [TensorBoard](https://www.tensorflow.org/tensorboard/)
590+
In addition to storing logs locally, we provide built-in support for two popular experiment monitoring frameworks: [Weights & Biases](https://wandb.ai/site), [TensorBoard](https://www.tensorflow.org/tensorboard/), and [Comet](https://www.comet.com/site)
578591
579592
## Weights and Biases
580593
581-
EleutherAI is currently using [Weights & Biases to record our experiments](https://wandb.ai/eleutherai/neox). If you are logged into Weights & Biases on your machine&mdash;you can do this by executing `wandb login`&mdash;your runs will automatically be recorded. There are two optional fields associated with Weights & Biases: <code><var>wandb_group</var></code> allows you to name the run group and <code><var>wandb_team</var></code> allows you to assign your runs to an organization or team account.
594+
[Weights & Biases to record our experiments](https://wandb.ai/eleutherai/neox) is a machine learning monitoring platform. To use wandb to monitor your gpt-neox experiments:
595+
1. Create an account at https://wandb.ai/site to generate your API key
596+
2. Log into Weights & Biases on your machine&mdash;you can do this by executing `wandb login`&mdash;your runs will automatically be recorded.
597+
3. Dependencies required for wandb monitoring can be found in and installed from `./requirements/requirements-wandb.txt`. An example config is provided in `./configs/local_setup_wandb.yml`.
598+
4. There are two optional fields associated with Weights & Biases: <code><var>wandb_group</var></code> allows you to name the run group and <code><var>wandb_team</var></code> allows you to assign your runs to an organization or team account. An example config is provided in `./configs/local_setup_wandb.yml`.
582599
583600
## TensorBoard
584601
585-
We also support using TensorBoard via the <code><var>tensorboard-dir</var></code> field. Dependencies required for TensorBoard monitoring can be found in and installed from `./requirements/requirements-tensorboard.txt`.
602+
We support using TensorBoard via the <code><var>tensorboard-dir</var></code> field. Dependencies required for TensorBoard monitoring can be found in and installed from `./requirements/requirements-tensorboard.txt`.
603+
604+
## Comet
605+
606+
[Comet](https://www.comet.com/site) is a machine learning monitoring platform. To use comet to monitor your gpt-neox experiments:
607+
1. Create an account at https://www.comet.com/login to generate your API key.
608+
2. Once generated, link your API key at runtime by running `comet login` or passing `export COMET_API_KEY=<your-key-here>`
609+
3. Install `comet_ml` and any dependency libraries via `pip install -r requirements/requirements-comet.txt`
610+
4. Enable Comet with `use_comet: True`. You can also customize where data is being logged with `comet_workspace` and `comet_project`. A full example config with comet enabled is provided in `configs/local_setup_comet.yml`.
611+
5. Run your experiment, and monitor metrics in the Comet workspace that you passed!
586612
587613
# Running on multi-node
588614
@@ -594,7 +620,9 @@ We support profiling with Nsight Systems, the PyTorch Profiler, and PyTorch Memo
594620
595621
## Nsight Systems Profiling
596622
597-
To use the Nsight Systems profiling, set config options `profile`, `profile_step_start`, and `profile_step_stop`. Launch training with:
623+
To use the Nsight Systems profiling, set config options `profile`, `profile_step_start`, and `profile_step_stop` (see [here](https://github.com/EleutherAI/gpt-neox/blob/main/configs/neox_arguments.md) for argument usage, and [here](https://github.com/EleutherAI/gpt-neox/blob/main/configs/prof.yml) for a sample config).
624+
625+
To populate nsys metrics, launch training with:
598626
599627
```
600628
nsys profile -s none -t nvtx,cuda -o <path/to/profiling/output> --force-overwrite true \
@@ -604,22 +632,22 @@ $TRAIN_PATH/train.py --conf_dir configs <config files>
604632
605633
The generated output file can then by viewed with the Nsight Systems GUI:
606634
607-
![Alt text](images/nsight_profiling.png)
635+
![nsight-prof](images/nsight_profiling.png)
608636
609637
## PyTorch Profiling
610638
611-
To use the built-in PyTorch profiler, set config options `profile`, `profile_step_start`, and `profile_step_stop`.
639+
To use the built-in PyTorch profiler, set config options `profile`, `profile_step_start`, and `profile_step_stop` (see [here](https://github.com/EleutherAI/gpt-neox/blob/main/configs/neox_arguments.md) for argument usage, and [here](https://github.com/EleutherAI/gpt-neox/blob/main/configs/prof.yml) for a sample config).
612640
613641
The PyTorch profiler will save traces to your `tensorboard` log directory. You can view these traces within
614642
TensorBoard by following the steps [here](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html).
615643
616-
![Alt text](images/pytorch_profiling.png)
644+
![torch-prof](images/pytorch_profiling.png)
617645
618646
## PyTorch Memory Profiling
619647
620-
To use PyTorch Memory Profiling, set config options `memory_profiling` and `memory_profiling_path`.
648+
To use PyTorch Memory Profiling, set config options `memory_profiling` and `memory_profiling_path` (see [here](https://github.com/EleutherAI/gpt-neox/blob/main/configs/neox_arguments.md) for argument usage, and [here](https://github.com/EleutherAI/gpt-neox/blob/main/configs/prof.yml) for a sample config).
621649
622-
![Alt text](images/memory_profiling.png)
650+
![mem-prof](images/memory_profiling.png)
623651
624652
View the generated profile with the [memory_viz.py](https://github.com/pytorch/pytorch/blob/main/torch/cuda/_memory_viz.py) script. Run with:
625653
@@ -677,7 +705,7 @@ The following publications by other research groups use this library:
677705
The following models were trained using this library:
678706
679707
### English LLMs
680-
- EleutherAI's [GPT-NeoX-20B](https://huggingface.co/EleutherAI/gpt-neox-20b), [Pythia (70M through 13B)](https://github.com/EleutherAI/pythia), and [LLeMMA (34B)](https://arxiv.org/abs/2310.10631)
708+
- EleutherAI's [GPT-NeoX-20B](https://huggingface.co/EleutherAI/gpt-neox-20b) and [Pythia (70M through 13B)](https://github.com/EleutherAI/pythia)
681709
- CarperAI's [FIM-NeoX-1.3B](https://huggingface.co/CarperAI/FIM-NeoX-1.3B)
682710
- StabilityAI's [StableLM (3B and 7B)](https://github.com/Stability-AI/StableLM)
683711
- Together.ai's [RedPajama-INCITE (3B and 7B)](https://together.ai/blog/redpajama-models-v1)
@@ -688,25 +716,29 @@ The following models were trained using this library:
688716
### Non-English LLMs
689717
- EleutherAI's [Polyglot-Ko (1.3B through 12.8B)](https://github.com/EleutherAI/polyglot) (Korean)
690718
- Korea University's [KULLM-Polyglot (5.8B and 12.8B)](https://github.com/nlpai-lab/KULLM) (Korean)
691-
- Stability AI's [Japanese Stable LM (7B)](https://huggingface.co/stabilityai/japanese-stablelm-base-alpha-7b)
719+
- Stability AI's [Japanese Stable LM (7B)](https://huggingface.co/stabilityai/japanese-stablelm-base-alpha-7b) (Japanese)
692720
- LearnItAnyway's [LLaVA-Polyglot-Ko (1.3B)](https://huggingface.co/LearnItAnyway/llava-polyglot-ko-1.3b-hf) (Korean)
693721
- Rinna Co.'s [japanese-gpt-neox-3.6b](https://huggingface.co/rinna/japanese-gpt-neox-3.6b) (Japanese) and [bilingual-gpt-neox-4b](https://huggingface.co/rinna/bilingual-gpt-neox-4b) (English / Japanese)
694722
- CyberAgent's [Open-CLM (125M through 7B)](https://huggingface.co/cyberagent/open-calm-7b) (Japanese)
695723
- The Hungarian Research Centre for Linguistics's [PULI GPTrio (6.7B)](https://huggingface.co/NYTK/PULI-GPTrio) (Hungarian / English / Chinese)
696724
- The University of Tokyo's [weblab-10b](https://huggingface.co/Kojima777/weblab-10b) and [weblab-10b-instruct](https://huggingface.co/Kojima777/weblab-10b-instruction-sft) (Japanese)
697725
- nolando.ai's [Hi-NOLIN (9B)](https://blog.nolano.ai/Hi-NOLIN/) (English, Hindi)
726+
- Renmin University of China's [YuLan (12B)](https://huggingface.co/yulan-team/YuLan-Base-12b) (English, Chinese)
727+
- The Basque Center for Language Technology's [Latixna (70B)](https://huggingface.co/HiTZ/latxa-70b-v1.2) (Basque)
698728
699729
### Code Models
700730
- Carnegie Mellon University's [PolyCoder (160M through 2.7B)](https://github.com/VHellendoorn/Code-LMs) and [CAT-LM (2.7B)](https://huggingface.co/nikitharao/catlm)
701731
- StabilityAI's [StableCode (1.3B)](https://stability.ai/blog/stablecode-llm-generative-ai-coding) and [StableCode-Completion-Alpha (3B)](https://stability.ai/blog/stablecode-llm-generative-ai-coding)
702732
- CodeFuse AI's [CodeFuse (13B)](https://huggingface.co/codefuse-ai/CodeFuse-13B)
703733
704734
### AI for Science
735+
- EleutherAI's [LLeMMA (34B)](https://arxiv.org/abs/2310.10631)
705736
- Oak Ridge National Lab's [FORGE (26B)](https://github.com/at-aaims/forge)
706-
- Oak Ridge National Lab and EleutherAI's [Unnamed Material Science Domain Models (7B)](https://github.com/at-aaims/forge)
737+
- Oak Ridge National Lab's [Unnamed Material Science Domain Models (7B)](https://arxiv.org/abs/2402.00691)
707738
- Pacific Northwest National Lab's [MolJet (undisclosed size)](https://openreview.net/pdf?id=7UudBVsIrr)
708739
709740
### Other Modalities
741+
- Rinna Co.'s [PSLM (7B)](https://arxiv.org/abs/2406.12428) (speech / text)
710742
- University College London's [ChessGPT-3B](https://huggingface.co/Waterhorse/chessgpt-base-v1)
711743
- Gretel's [Text-to-Table (3B)](https://huggingface.co/gretelai/text2table)
712744

0 commit comments

Comments
 (0)