@@ -6,9 +6,9 @@ Welcome to our MIT Emerging Talent Collaborative Data Science Project repository
66
77This project explores how data science, collaboration, and domain expertise
88intersect to solve real-world problems. We are currently
9- in ** Milestone 1: Problem Identification ** , focused on making an initial domain study
10- and framing an actionable research question in our project domain,
11- and within our groups’ constraints .
9+ in ** Milestone 3: Data Exploration and Analysis ** , focused on exploring our
10+ cleaned dataset, performing predictive modeling, and evaluating how student
11+ engagement patterns relate to academic performance .
1212
1313---
1414
@@ -212,6 +212,141 @@ Below are the data dictionaries for the files in this dataset, outlining column
212212 Create a comprehensive data dictionary for the final, cleaned, and integrated
213213 dataset.
214214
215+ ## Milestone 3: Data Exploration and Analysis
216+
217+ We explored and modeled the cleaned Student Engagement Dataset (SED) to address
218+ our research question about how student interaction patterns predict academic
219+ performance.
220+
221+ ### Non-Technical Explanation of Our Findings
222+
223+ We analyzed data on online student activity to understand whether patterns of
224+ engagement can help predict academic performance.
225+
226+ #### Key Findings
227+
228+ - Students who log in more frequently, participate in forums, and complete
229+ assignments tend to achieve higher marks.
230+
231+ - This suggests that consistent, active engagement is linked to better
232+ academic outcomes in online learning environments.
233+
234+ #### Visual Evidence
235+
236+ #### Scatter Plots of Engagement Features vs. Average Marks
237+
238+ ![ Scatterplots of Engagement Features] ( https://github.com/user-attachments/assets/2019a625-9e07-46e8-b9d6-5ce5251be465 )
239+
240+ > The scatter plots show positive trends between average marks and:
241+ >
242+ > - Days active
243+ > - Total events
244+ > - Forum posts
245+ > - Number of assignments
246+
247+ These relationships suggest that students with more consistent activity tend to
248+ score higher.
249+
250+ #### Correlation Heatmap
251+
252+ ![ Correlation Heatmap] ( https://github.com/user-attachments/assets/389f0f9d-0315-4e53-b8df-77385a6db3b4 )
253+
254+ > The heatmap shows strong positive correlations among key engagement features,
255+ > supporting their predictive value for academic performance.
256+
257+ #### Prediction Accuracy
258+
259+ - Our simple regression model explains about ** 69%** of the variation
260+ in student marks.
261+ - This means engagement patterns provide meaningful predictive power.
262+
263+ #### Sources of Error or Uncertainty
264+
265+ - Unmeasured factors like motivation or prior knowledge.
266+ - Limits of our linear model (it may not capture all complexity).
267+ - Data quality and accuracy of online logs.
268+
269+ #### What Does This Mean?
270+
271+ - Online learning platforms could use engagement data to identify students at
272+ risk of underperforming.
273+ - Instructors can intervene early, offering support or feedback to improve outcomes.
274+
275+ ### Technical Description of Our Analysis and Results
276+
277+ We aimed to answer: ** Can online engagement patterns predict academic performance?**
278+
279+ ** Analysis steps:**
280+
281+ #### Exploratory Data Analysis (EDA)
282+
283+ - Inspected missing values, distributions, and outliers.
284+ - Visualized relationships using:
285+ - ** Scatter plots** between average marks and key engagement metrics
286+ (e.g. days active, forum posts).
287+ - ** Correlation heatmaps** to detect linear relationships among features.
288+
289+ - EDA helps understand data structure, find potential predictors, and check for
290+ multicollinearity.
291+
292+ #### Feature Selection & Engineering
293+
294+ - Chose variables capturing:
295+ - Logins by time of day/week
296+ - Forum activity
297+ - Assignments submitted
298+ - Overall events
299+ - Dropped redundant or index columns.
300+
301+ - We focused on measurable, meaningful engagement metrics relevant to instructors
302+ and LMS systems.
303+
304+ #### Modeling Approach
305+
306+ - Applied ** Linear Regression** :
307+ - Split data (80% train / 20% test).
308+ - Fitted model on training set.
309+ - Evaluated using ** Mean Squared Error (106.68)** and ** R-squared (0.69)** .
310+
311+ ** Why Linear Regression?**
312+
313+ - Interpretable coefficients.
314+ - Good first step to assess linear relationships.
315+ - Easy to communicate findings.
316+
317+ #### Evaluation
318+
319+ - ** Residual Plots** :
320+ - Showed some spread, suggesting imperfect fit and possible heteroscedasticity.
321+ - ** Coefficients Analysis** :
322+ - Identified important features (e.g., total_events, no_of_assignments, forum posts).
323+
324+ ** Possible flaws in our analysis:**
325+
326+ - ** Linearity assumption** may not hold for all relationships.
327+ - ** No interaction terms** or non-linear effects captured.
328+ - ** Potential overfitting** despite test evaluation.
329+ - ** Feature engineering scope** limited to simple counts.
330+
331+ ** Conclusion:**
332+ Our linear regression model provides a solid baseline, explaining 69% of
333+ variance in average marks using straightforward engagement features. Future
334+ iterations can improve predictive power and interpretability by adopting more
335+ advanced modeling techniques.
336+
337+ ### Reproducibility
338+
339+ All data, notebooks, and scripts necessary to replicate this analysis are
340+ included in:
341+
342+ - /1_datasets/
343+ - /3_data_exploration/
344+ - /4_data_analysis/
345+
346+ Please see these folders to run the full analysis pipeline using our cleaned dataset.
347+
348+ ---
349+
215350## 📁 Repository Structure
216351
217352``` bash
@@ -237,7 +372,7 @@ Below are the data dictionaries for the files in this dataset, outlining column
237372- 🔹 [ Communication Plan] ( collaboration/communication.md )
238373- 🔹 [ Constraints] ( collaboration/constraints.md )
239374- 🔹 [ Learning Goals] ( collaboration/learning_goals.md )
240- - 🔹 [ Retrospective (Milestone 1) ] ( collaboration/retrospectives )
375+ - 🔹 [ Retrospective] ( collaboration/retrospectives )
241376
242377---
243378
@@ -248,7 +383,7 @@ Below are the data dictionaries for the files in this dataset, outlining column
248383| 0 | Cross-Cultural Collaboration | 🟢 Done | June 2 |
249384| 1 | Problem Identification | 🟢 Done | June 16 |
250385| 2 | Data Collection | 🟢 Done | June 30 |
251- | 3 | Data Analysis | ⏳ Upcoming | July 21 |
386+ | 3 | Data Analysis | 🟢 Done | July 21 |
252387| 4 | Communicating Results | ⏳ Upcoming | August 11 |
253388| 5 | Final Presentation | ⏳ Upcoming | August 25 |
254389
0 commit comments