This repository provides a template and guidelines for the Emerging Talent 6 Collaborative Data Science Project (CDSP). It is designed to support a reproducible workflow so that anyone can clone, run, and verify your analysis from start to finish.
The CDSP is a six‐milestone initiative where teams work together to identify a research question, collect and analyze data, and communicate findings. This repository’s structure follows those milestones and encourages best practices in collaboration, version control, and documentation.
Use this template to:
- Organize your files and code in a clear, consistent folder structure
- Maintain reproducibility so others can run your pipeline without issues
- Document team norms, communication plans, and milestone retrospectives
- Track tasks, issues, and pull requests through GitHub’s project boards
/
├── README.md # Project overview and main instructions
├── guide.md # Detailed guide on using this template
├── /collaboration/ # Team norms, strategies, and retrospectives
├── /notes/ # Shared resources and learning materials
├── /0_domain_study/ # Domain research and background
├── /1_datasets/ # Raw and processed datasets
├── /2_data_preparation/ # Scripts for cleaning and processing data
├── /3_data_exploration/ # Scripts for initial data understanding
├── /4_data_analysis/ # Scripts for in-depth analysis
├── /5_communication_strategy/ # Materials for communicating findings
└── /6_final_presentation/ # Final presentation materials
Below are some suggestions on how to use the folders/files in this repository, but they're just suggestions! Your group should find a system that works for you
- Use
/collaborationto document team norms (inREADME.md), collaboration strategies, project goals, and your milestone retrospectives. This folder is a living document — you should update it through the whole project. - Use
/notesto share useful resources: tools, tutorials, examples, or anything else that helps your team learn and work better. - In the main
README.md, write a short intro to your team and what you're hoping to study. - Create a retrospective for this milestone in
/collaborationusing the template.
- Use
/0_domain_studyto build a reference folder for your research domain. A new teammate should be able to catch up using this folder. - Use
/0_domain_study/README.mdto organize your notes in0_domain_studyso people don't have to read every single file to find what they need. - In the main
README.md, write a summary of your research question, relevant background, and why you think this is a meaningful problem to work on. - Create a retrospective for this milestone in
/collaborationusing the template.
- Store all datasets (raw or processed) in
/1_datasets. This folder is for data only — not code. (Unless you happen to have a dataset that's inside a.pyfile…) - In
/1_datasets/README.md, document each dataset: where it's from, how it was collected, how it connects to your research question, and any limitations or caveats. - Use
/2_data_preparationto keep all your cleaning, transformation, and prep scripts. These scripts should read data from/1_datasetsand write new datasets back to that same folder. - In
/2_data_preparation/README.md, explain what each script does: which datasets it reads, what it does to them, and what outputs it creates. - Use
/3_data_explorationto explore, visualize, and get a feel for your datasets. This isn't the place for answering research questions — it's just to understand your data. - In
/3_data_exploration/README.md, summarize what each script/notebook explores, and which datasets it uses. - In the main
README.md, describe how you're modeling your research question with data, what datasets you're using, and how you prepared them. - Create a retrospective for this milestone in
/collaborationusing the template.
- Use
/4_data_analysisfor scripts and notebooks that actually analyze your data to answer your research question. Don’t try to cram everything into one file — you can have many scripts/notebooks in here as long as they are clearly named. It's expected that your research findings and conclusions will be the result of many smaller analyses, trying to fit everything into a single notebook will be unhelpful. You can always cite different scripts/notebooks to support different parts of your conclusions. - In
/4_data_analysis/README.md, outline your analysis strategy and summarize what each script or notebook does. - In the main
README.md, include:- A short summary of your analysis approach
- A clear statement of your research conclusions (right at the top)
- How confident you are in your results
- What limitations your work has
- Any ideas for future research
- Create a retrospective for this milestone in
/collaborationusing the template.
- Use
/5_communication_strategyfor planning and drafting your communication artefact. That includes audience research, message development, and assets like images or scripts. - You don’t need to store the final artefact here if that doesn’t make sense For example, in cohort 6 a group created an instagram account and meme campaign as their communication strategy! You can't push that to a folder on GitHub.
- In
/5_communication_strategy/README.md, summarize your strategy: who you’re reaching, what you’re saying, and why. - In the main
README.md, include a summary of your communication strategy and a link (if possible) to your final artefact. - Create a retrospective for this milestone in
/collaborationusing the template.
- Use
/6_final_presentationto store slides, scripts, or notes from preparing your final presentation. - In
/6_final_presentation/README.md, list what’s in the folder and link to your actual presentation. - Create a retrospective for this milestone in
/collaborationusing the template.
- Keep README files updated as you go. They’re for humans. Future-you is a human.
- Reproducibility is key. Someone else should be able to run your pipeline without tweaking your code or guessing what goes where.
- Use clear, consistent file names — you don’t want to waste time figuring out
what
final_final_revised3.ipynbwas supposed to do. - Document your work as you’re doing it. Waiting until the end = pain.
- Cross-reference when needed. ("This analysis uses the cleaned data from
/2_data_preparation/clean_survey_data.py.") - Commit early and often. Write commit messages that your teammates (and your future self) will understand.
- Do regular repo reviews as a team. Is everything findable? Understandable?
Happy studies!