diff --git a/README.md b/README.md index aa0fb63..22a1165 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,15 @@ # TinyDialogues This repository contains the code and data for the paper: - **[Is Child-Directed Speech Effective Training Data for Language Models?](https://aclanthology.org/2024.emnlp-main.1231/)** **Authors:** [Steven Y. Feng](https://styfeng.github.io/), [Noah D. Goodman](https://cocolab.stanford.edu/ndg), and [Michael C. Frank](https://web.stanford.edu/~mcfrank/) (Stanford University). > Please contact syfeng@stanford.edu if you have any questions or concerns. ---- +
-## ๐Ÿ“ฆ Data +Data - **TinyDialogues Dataset** is hosted on HuggingFace: [styfeng/TinyDialogues](https://huggingface.co/datasets/styfeng/TinyDialogues) - Other datasets can be found under the `data/` folder (organized into `.zip` files). @@ -18,9 +17,13 @@ This repository contains the code and data for the paper: - Each `.txt` file (for train/val) contains one example per line. - Each line must end with the token `<|endoftext|>`. -### Repeated Buckets Setup for Curriculum Experiments +
+ +
+ +Repeated Buckets Setup for Curriculum Experiments -#### CHILDES: +### CHILDES: ```bash python scripts/repeated_buckets/childes_repeated_buckets_setup.py \ @@ -29,7 +32,7 @@ python scripts/repeated_buckets/childes_repeated_buckets_setup.py \ This splits the given [CHILDES data](https://github.com/styfeng/TinyDialogues/blob/main/data/CHILDES_data.zip) file into `` buckets, and repeats each one `` times before moving onto the next bucket. -#### TinyDialogues: +### TinyDialogues: ```bash python scripts/repeated_buckets/TD_repeated_buckets_setup.py \ @@ -38,16 +41,20 @@ python scripts/repeated_buckets/TD_repeated_buckets_setup.py \ This uses the TD individual age data files (found [here](https://huggingface.co/datasets/styfeng/TinyDialogues/blob/main/individual_age_data.zip)) as buckets (one bucket per age) and repeats each one `` times before moving onto the next bucket. ---- +
+ +
-## ๐Ÿง  Model Configs & Tokenizers +Model Configs & Tokenizers - Pretrained tokenizers (for each dataset) can be found under the `tokenizers/` folder. - Default GPT-2-small and RoBERTa-base model configs can also be found there. ---- +
+ +
-## โš™๏ธ Environment Setup +Environment Setup ### Step 1: Install Miniconda (if needed) ```bash @@ -106,10 +113,12 @@ sudo apt-get install libnccl2=2.18.3-1+cuda12.1 libnccl-dev=2.18.3-1+cuda12.1 #### Optional: Other fixes If you encounter this error when training GPT-2: `TypeError: TextConfig.__init__() got an unexpected keyword argument 'use_auth_token'`, comment out all lines that include `use_auth_token` in `run_clm_no_shuffling.py`. - ---- -## ๐Ÿงช Model Training +
+ +
+ +Model Training ### First Step: Tokenizer Training @@ -173,9 +182,11 @@ python scripts/language_model_training/train_roberta_directly_seed.py \ 30000 512 no 5e-05 50 32 42 ``` ---- +
+ +
-## ๐Ÿงช Evaluation +Evaluation ### Zorro Evaluation @@ -267,9 +278,9 @@ Modify this script to iterate over lists of models and run evaluation automatica > Note: To re-run evaluation on the same model, delete the corresponding `.pkl` files in `LexiContrastiveGrd/src/llm_devo/word_sim/llm_devo_word_sim_results/human_sim/miniBERTa` or rename the folder to avoid `RuntimeWarning: Mean of empty slice.` errors. ---- +
-## ๐Ÿ“‘ Citation +## Citation If you use this codebase or dataset, please cite: