Skip to content

Commit 46a39a2

Browse files
committed
updating the main README
1 parent 7ed0c61 commit 46a39a2

File tree

1 file changed

+140
-5
lines changed

1 file changed

+140
-5
lines changed

README.md

Lines changed: 140 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@ Welcome to our MIT Emerging Talent Collaborative Data Science Project repository
66

77
This project explores how data science, collaboration, and domain expertise
88
intersect to solve real-world problems. We are currently
9-
in **Milestone 1: Problem Identification**, focused on making an initial domain study
10-
and framing an actionable research question in our project domain,
11-
and within our groups’ constraints.
9+
in **Milestone 3: Data Exploration and Analysis**, focused on exploring our
10+
cleaned dataset, performing predictive modeling, and evaluating how student
11+
engagement patterns relate to academic performance.
1212

1313
---
1414

@@ -212,6 +212,141 @@ Below are the data dictionaries for the files in this dataset, outlining column
212212
Create a comprehensive data dictionary for the final, cleaned, and integrated
213213
dataset.
214214

215+
## Milestone 3: Data Exploration and Analysis
216+
217+
We explored and modeled the cleaned Student Engagement Dataset (SED) to address
218+
our research question about how student interaction patterns predict academic
219+
performance.
220+
221+
### Non-Technical Explanation of Our Findings
222+
223+
We analyzed data on online student activity to understand whether patterns of
224+
engagement can help predict academic performance.
225+
226+
#### Key Findings
227+
228+
- Students who log in more frequently, participate in forums, and complete
229+
assignments tend to achieve higher marks.
230+
231+
- This suggests that consistent, active engagement is linked to better
232+
academic outcomes in online learning environments.
233+
234+
#### Visual Evidence
235+
236+
#### Scatter Plots of Engagement Features vs. Average Marks
237+
238+
![Scatterplots of Engagement Features](https://github.com/user-attachments/assets/2019a625-9e07-46e8-b9d6-5ce5251be465)
239+
240+
> The scatter plots show positive trends between average marks and:
241+
>
242+
> - Days active
243+
> - Total events
244+
> - Forum posts
245+
> - Number of assignments
246+
247+
These relationships suggest that students with more consistent activity tend to
248+
score higher.
249+
250+
#### Correlation Heatmap
251+
252+
![Correlation Heatmap](https://github.com/user-attachments/assets/389f0f9d-0315-4e53-b8df-77385a6db3b4)
253+
254+
> The heatmap shows strong positive correlations among key engagement features,
255+
> supporting their predictive value for academic performance.
256+
257+
#### Prediction Accuracy
258+
259+
- Our simple regression model explains about **69%** of the variation
260+
in student marks.
261+
- This means engagement patterns provide meaningful predictive power.
262+
263+
#### Sources of Error or Uncertainty
264+
265+
- Unmeasured factors like motivation or prior knowledge.
266+
- Limits of our linear model (it may not capture all complexity).
267+
- Data quality and accuracy of online logs.
268+
269+
#### What Does This Mean?
270+
271+
- Online learning platforms could use engagement data to identify students at
272+
risk of underperforming.
273+
- Instructors can intervene early, offering support or feedback to improve outcomes.
274+
275+
### Technical Description of Our Analysis and Results
276+
277+
We aimed to answer: **Can online engagement patterns predict academic performance?**
278+
279+
**Analysis steps:**
280+
281+
#### Exploratory Data Analysis (EDA)
282+
283+
- Inspected missing values, distributions, and outliers.
284+
- Visualized relationships using:
285+
- **Scatter plots** between average marks and key engagement metrics
286+
(e.g. days active, forum posts).
287+
- **Correlation heatmaps** to detect linear relationships among features.
288+
289+
- EDA helps understand data structure, find potential predictors, and check for
290+
multicollinearity.
291+
292+
#### Feature Selection & Engineering
293+
294+
- Chose variables capturing:
295+
- Logins by time of day/week
296+
- Forum activity
297+
- Assignments submitted
298+
- Overall events
299+
- Dropped redundant or index columns.
300+
301+
- We focused on measurable, meaningful engagement metrics relevant to instructors
302+
and LMS systems.
303+
304+
#### Modeling Approach
305+
306+
- Applied **Linear Regression**:
307+
- Split data (80% train / 20% test).
308+
- Fitted model on training set.
309+
- Evaluated using **Mean Squared Error (106.68)** and **R-squared (0.69)**.
310+
311+
**Why Linear Regression?**
312+
313+
- Interpretable coefficients.
314+
- Good first step to assess linear relationships.
315+
- Easy to communicate findings.
316+
317+
#### Evaluation
318+
319+
- **Residual Plots**:
320+
- Showed some spread, suggesting imperfect fit and possible heteroscedasticity.
321+
- **Coefficients Analysis**:
322+
- Identified important features (e.g., total_events, no_of_assignments, forum posts).
323+
324+
**Possible flaws in our analysis:**
325+
326+
- **Linearity assumption** may not hold for all relationships.
327+
- **No interaction terms** or non-linear effects captured.
328+
- **Potential overfitting** despite test evaluation.
329+
- **Feature engineering scope** limited to simple counts.
330+
331+
**Conclusion:**
332+
Our linear regression model provides a solid baseline, explaining 69% of
333+
variance in average marks using straightforward engagement features. Future
334+
iterations can improve predictive power and interpretability by adopting more
335+
advanced modeling techniques.
336+
337+
### Reproducibility
338+
339+
All data, notebooks, and scripts necessary to replicate this analysis are
340+
included in:
341+
342+
- /1_datasets/
343+
- /3_data_exploration/
344+
- /4_data_analysis/
345+
346+
Please see these folders to run the full analysis pipeline using our cleaned dataset.
347+
348+
---
349+
215350
## 📁 Repository Structure
216351

217352
```bash
@@ -237,7 +372,7 @@ Below are the data dictionaries for the files in this dataset, outlining column
237372
- 🔹 [Communication Plan](collaboration/communication.md)
238373
- 🔹 [Constraints](collaboration/constraints.md)
239374
- 🔹 [Learning Goals](collaboration/learning_goals.md)
240-
- 🔹 [Retrospective (Milestone 1)](collaboration/retrospectives)
375+
- 🔹 [Retrospective](collaboration/retrospectives)
241376

242377
---
243378

@@ -248,7 +383,7 @@ Below are the data dictionaries for the files in this dataset, outlining column
248383
| 0 | Cross-Cultural Collaboration | 🟢 Done | June 2 |
249384
| 1 | Problem Identification | 🟢 Done | June 16 |
250385
| 2 | Data Collection | 🟢 Done | June 30 |
251-
| 3 | Data Analysis | ⏳ Upcoming | July 21 |
386+
| 3 | Data Analysis | 🟢 Done | July 21 |
252387
| 4 | Communicating Results | ⏳ Upcoming | August 11 |
253388
| 5 | Final Presentation | ⏳ Upcoming | August 25 |
254389

0 commit comments

Comments
 (0)