A portfolio-grade survey audit that evaluates:
- Coverage (collection window and timing signals)
- Representativeness (sample composition + imbalance indicators)
- Measurement risks (scale design pitfalls)
- Data validity checks (ranges, allowed values, duplicates)
- Open-text normalization (turn messy “stress causes” into a clean taxonomy)
This project is intentionally designed to answer a higher-level question than typical EDA:
“How much can we trust insights from this survey, and what would make them more trustworthy?”
This project uses the Kaggle dataset:
- Student Mental Health and Academic Pressure by alhamdulliah123
- Dataset URL: https://www.kaggle.com/datasets/alhamdulliah123/student-mental-health-and-academic-pressure
Thanks to the dataset author and Kaggle for sharing the data for learning and analysis.
- Why this project matters
- What you get
- Quickstart
- How the audit works
- Figures & interpretation
- Decision safety rules
- Outputs
- Customize / extend
- Notes & limitations
Survey datasets often look “clean”:
- no missing values,
- simple distributions,
- easy charts.
But surveys can still be analytically dangerous because:
- the sample may be skewed (selection bias),
- the collection window may reflect a narrow moment (exams),
- the question design may inflate responses (measurement bias),
- open-text answers create noise that blocks usable insights.
This project makes those risks visible and actionable.
A markdown report generated at:
reports/audit_report.md
It includes:
- dataset overview
- validity checks
- representativeness checks
- bias risk matrix
- taxonomy summary
- recommendations for improving the next survey run
Saved under:
reports/figures/
These figures are designed to be:
- stakeholder-friendly
- easy to interpret
- honest about uncertainty & representativeness
A simple dashboard at:
app/app.py
It loads your generated outputs and visuals, and displays the report inside the app.
python -m venv .venv
# Windows:
# .venv\Scripts\activate
source .venv/bin/activate
pip install -r requirements.txtpython -m src.pipeline --input data/raw/student_survey.csvstreamlit run app/app.pyThe pipeline performs these steps:
- Schema standardization
-
Survey exports often contain messy column names (spaces/newlines).
-
The pipeline maps them into canonical names like:
age_group,gender,education_level,academic_pressure, etc.
- Type parsing
- Parses timestamps and numeric pressure values safely (
errors="coerce").
- Validity checks
- Academic pressure is expected in [1..5].
- Stress frequency is validated against expected categories.
- Timestamp parsing success rate is recorded.
- Coverage signals
- If timestamps exist: the pipeline estimates the collection window and response burst patterns.
- Representativeness / imbalance
-
For demographics, computes:
max_share(how dominant the top category is)- entropy (diversity)
- effective number of categories
- Text normalization
-
Open-text stress causes are cleaned and mapped into a taxonomy:
- Exams & grades, Financial, Workload & time, etc.
-
Also exports a transparent mapping file for review.
- Reporting
-
Writes:
reports/audit_report.md- key CSV outputs
- figures folder
Important note: These charts describe the survey respondents, not the full student population. Use language like “in this sample” and avoid population claims unless sampling is representative.
Counts of respondents by age group.
If one age group dominates the sample, you may accidentally conclude that:
- “students” behave a certain way, when you really mean:
- “students in the dominant age group in this sample.”
- Treat this as a coverage map of who responded.
- When comparing groups, avoid strong claims for groups with very low counts.
If you want population-level insights, the next survey run should:
- recruit across age groups intentionally (quotas or stratified sampling)
- publish subgroup counts with every chart
Counts of respondents by education level.
If “College” dominates, then:
- pressure/stress distributions reflect college students more than others.
People interpret this as:
- “College students are the majority in reality” But it might only mean:
- “College students were more likely to see/respond to the survey.”
- Track recruitment channel(s).
- Report response rate per group if possible.
Counts of respondents by gender.
Even if the dataset is balanced, gender composition affects interpretation, especially for:
- stress reporting tendencies
- social desirability effects
- willingness to respond
- Use this chart to decide which subgroup comparisons are meaningful.
- Always show counts next to percentages.
How many respondents selected each pressure level (1–5).
- Pressure is ordinal (1 < 2 < 3 < 4 < 5), not a continuous measurement.
- For small samples, medians and category shares are often more honest than complicated models.
- Avoid causal language like “X causes high pressure.”
- Avoid population prevalence claims unless the sample is representative.
- Describe where the sample concentrates (more 4–5 than 1–2).
- Use it to form hypotheses for larger data collection.
How many responses were submitted each day (based on timestamps).
If responses cluster heavily in one day, the survey may capture:
- a specific moment in time (exam week, deadlines, holidays), not a stable trend.
A short, bursty collection window increases the risk of:
- time-window bias, and makes it harder to generalize.
- Collect over multiple weeks.
- Add an “exam period” or “major deadlines this week” question.
Self-reported sleep category counts.
Sleep is often used as an explanatory factor, but here it is:
- categorical and broad, which reduces precision.
- Use it to describe the sample.
- Avoid overinterpreting differences unless subgroup sizes are reasonable.
If you want stronger analysis:
- collect sleep as a numeric value (hours)
- optionally collect “sleep quality” too
Counts of stress frequency responses.
If the survey scale lacks “Never” / “Rarely”, responses can be forced upward. That can inflate perceived stress frequency.
- Explicitly mention scale design limitations.
- Use phrases like “on this response scale” rather than “in reality”.
Use a balanced scale:
- Never / Rarely / Sometimes / Often / Always
Open-text “main cause of stress” answers normalized into a small set of categories.
Raw open-text answers often look like:
- many unique strings, even when they describe the same idea (“exams”, “tests”, “quiz pressure”).
A taxonomy makes the signal readable and reduces noise.
The raw-to-category mapping is exported to:
outputs/taxonomy_mapping.csv
So you can:
- review decisions,
- adjust rules,
- and rerun the pipeline.
Use a hybrid approach:
- multi-select structured options + optional free text This preserves interpretability while still capturing new themes.
These are rules you can apply to keep analysis honest:
-
Always include subgroup counts in subgroup charts.
-
Avoid interpreting subgroup differences when n < 5.
-
Use wording:
- “In this sample…”
- “Students are…”
-
Treat the dataset as hypothesis-generating, not population definitive.
-
When in doubt, prefer:
- counts + ranges
- clear limitations over confident claims.
After running:
python -m src.pipeline --input data/raw/student_survey.csvYou get:
reports/audit_report.md| generated audit reportreports/figures/| all chartsoutputs/audit_summary.json| summary metadata + checksoutputs/bias_risk_matrix.csv| structured bias tableoutputs/grouped_stats.csv| safe grouped summaries (means/medians)outputs/taxonomy_mapping.csv| transparent text mapping
Edit rules in:
src/text_taxonomy.py
Then rerun:
python -m src.pipeline --input data/raw/student_survey.csvIf you want next-level reporting:
- bootstrap confidence intervals for shares
- Bayesian credible intervals for proportions
This is ideal for small N and looks very professional.
If you see a warning about infer_datetime_format:
- remove that argument from
pd.to_datetime(...)insrc/clean_schema.py. New pandas versions don’t need it.
- Small dataset → subgroup comparisons can be unstable.
- Self-report → subject to social desirability and reporting bias.
- Recruitment channel unknown → selection bias likely.
- Missing scale options can inflate reported frequencies.
These limitations are not “bad news”, they are exactly what makes this project mature: you are showing that you can produce analytics that are safe to trust.