“Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.”
All done :)
This retrospective is meant for looking back at how Milestone 2: Data Collection went and learning what to do differently next time.
Focus on what your group can do that will make the next sprint better. Keep your retrospectives positive and general. You should NEVER mention people by name!!!
-
What parts of your plan went as expected?
- The structured approach to data selection and justification worked very well, allowing us to systematically evaluate candidate datasets against our research question and project constraints.
- The division of tasks for initial data exploration and documentation was effective.
- The collaboration on identifying and acquiring the primary dataset (SED) was smooth.
-
What parts of your plan did not work out?
- Underestimating the computational resources required for processing very
large raw data files (like
SED_Student_log.csv) within the sandbox environment led to delays in automated data dictionary generation and initial cleaning execution. - Initial assumptions about direct download link reliability for all external datasets proved incorrect (e.g., Kaggle 403 error).
- Underestimating the computational resources required for processing very
large raw data files (like
-
Did you need to add things that weren't in your strategy?
-
Stop Doing
- Assuming sandbox capabilities for very large datasets: For future milestones involving large data, we should proactively assess computational requirements and plan for local execution or alternative cloud resources from the outset.
- Relying solely on single-source download links: For critical external data, we should always have backup acquisition strategies or verify link persistence early.
-
Continue Doing
- Structured evaluation: The systematic approach to evaluating datasets (using criteria like relevance, quality, granularity, and practical considerations) was highly effective and should be continued for any future data acquisition.
- Detailed documentation: Creating dedicated reports (like
data_processing_report.md) and comprehensiveREADME.mdfiles for each phase significantly improves clarity and reproducibility. - Collaborative problem-solving: When encountering technical blockers (e.g., sandbox limitations), leveraging team members' local environments for processing was an effective workaround.
-
Start Doing
- Early resource assessment: Before starting data processing, explicitly assess the computational resources needed for the expected data volume and complexity.
- Version control for cleaned data: Implement a clear strategy for versioning and storing cleaned datasets, especially if they are generated through complex pipelines.
- Automated data quality checks: Integrate basic automated data quality checks into our cleaning scripts to catch common issues early.
-
Lessons Learned
- Data is messy, and that's okay: Even with well-defined problems, real-world data presents unexpected challenges (size, format, missingness), and adaptability is key.
- The importance of the data pipeline: Understanding the flow from raw
-
We successfully identified and selected a suitable dataset that aligned with our project focus on student engagement.
-
Our scripted exploration tools worked well to inspect the structure and health of each dataset.
-
Creating data dictionaries allowed us to document features.
-
We underestimated the complexity of the raw data, especially merging inconsistently structured CSVs.
-
Some file paths and documentation standards needed adjustment after initial confusion due to directory changes.
-
We created a dedicated exploratory script (
explore_sed_data.py) to inspect and summarize datasets before cleaning. -
We included a data processing report to communicate preprocessing decisions more clearly.
-
We simplified the initial cleaning process by focusing on essential columns and omitting overly detailed transformations that could wait until analysis.
-
Creating data dictionaries allowed us to document features systematically and ensure transparency.
-
What parts of our plan did not work out?
- We underestimated the complexity of the raw data, especially merging inconsistently structured CSVs.
- Some file paths and documentation standards needed adjustment after initial confusion due to directory changes.
-
Assuming dataset readiness: We sometimes assumed data would be immediately usable. We should always allocate time for deep inspection.
-
Manual patching: Avoid ad hoc fixes; instead, build reproducible scripts from the start.
-
Late naming conventions: Inconsistent filenames and paths caused small setbacks—naming should be standardized earlier.
-
Exploratory scripting: Generating
.info()and.head()views for each CSV helped us quickly grasp structure and potential issues. -
Data dictionary generation: Automatically documenting datasets using markdown tables was helpful for both transparency and collaboration.
-
Collaborative debugging: Quick syncs when data issues arose allowed us to align and move forward efficiently.
-
Pre-flight file structure check: Verify and agree on directory structures and script paths before coding begins.
-
Data health checklist: Build a checklist for data readiness that includes null value inspection, consistent key formats, and expected joins.
-
Leverage version control for raw files: Even with large files, it’s useful to track metadata or hashes to know when datasets change.
-
Documentation is part of the data: Clear data dictionaries and structured outputs are not just helpful—they are essential for downstream tasks.
-
Preparation scripts are reusable assets: Our
explore_sed_data.pyis a strong tool that can support future iterations or similar projects. -
Data issues can cascade: Small data quality issues left unaddressed early can lead to confusion and delays later.
-
A clean dataset enables creativity: By resolving inconsistencies and documenting features, we set ourselves up for efficient exploration and hypothesis generation in Milestone 3.
-
Assuming dataset readiness: We sometimes assumed data would be immediately usable. We should always allocate time for deep inspection.
-
Manual patching: Avoid ad hoc fixes; instead, build reproducible scripts
from the start. -
Late naming conventions: Inconsistent filenames and paths caused small
setbacks—naming should be standardized earlier. -
Exploratory scripting: Generating
.info()and.head()views for each
CSV helped us quickly grasp structure and potential issues. -
Data dictionary generation: Automatically documenting datasets using
markdown tables was helpful for both transparency and collaboration. -
Collaborative debugging: Quick syncs when data issues arose allowed us to align and move forward efficiently.
-
Pre-flight file structure check: Verify and agree on directory structures and script paths before coding begins.
-
Data health checklist: Build a checklist for data readiness that includes null value inspection, consistent key formats, and expected joins.
-
Leverage version control for raw files: Even with large files, it’s
useful to track metadata or hashes to know when datasets change. -
Documentation is part of the data: Clear data dictionaries and structured outputs are not just helpful—they are essential for downstream tasks.
-
Preparation scripts are reusable assets: Our
explore_sed_data.pyis a
strong tool that can support future iterations or similar projects. -
Data issues can cascade: Small data quality issues left unaddressed early can lead to confusion and delays later.
-
A clean dataset enables creativity: By resolving inconsistencies and
As we move into the next phase of data exploration, we carry forward both
working tools and hard-won insight. This milestone clarified the value of
thorough data review and repeatable preprocessing. We are now better prepared
to generate insights with confidence. hypothesis generation in Milestone 3. -
Assuming dataset readiness: We sometimes assumed data would be immediately usable. We should always allocate time for deep inspection.
-
Manual patching: Avoid ad hoc fixes; instead, build reproducible scripts from the start.
-
Late naming conventions: Inconsistent filenames and paths caused small setbacks—naming should be standardized earlier.
- Exploratory scripting: Generating
.info()and.head()views for each CSV helped us quickly grasp structure and potential issues. - Data dictionary generation: Automatically documenting datasets using markdown tables was helpful for both transparency and collaboration.
- Collaborative debugging: Quick syncs when data issues arose allowed us to align and move forward efficiently.
- Pre-flight file structure check: Verify and agree on directory structures and script paths before coding begins.
- Data health checklist: Build a checklist for data readiness that includes null value inspection, consistent key formats, and expected joins.
- Leverage version control for raw files: Even with large files, it’s useful to track metadata or hashes to know when datasets change.
- Documentation is part of the data: Clear data dictionaries and structured outputs are not just helpful—they are essential for downstream tasks.
- Preparation scripts are reusable assets: Our
explore_sed_data.pyis a strong tool that can support future iterations or similar projects. - Data issues can cascade: Small data quality issues left unaddressed early can lead to confusion and delays later.
- A clean dataset enables creativity: By resolving inconsistencies and documenting features, we set ourselves up for efficient exploration and hypothesis generation in Milestone 3.
As we move into the next phase of data exploration, we carry forward both working tools and hard-won insight. This milestone clarified the value of thorough data review and repeatable preprocessing. We are now better prepared to generate insights with confidence.