Skip to content

AmirhosseinHonardoust/Student-Survey-Quality-Bias-Audit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Student Survey Quality & Bias Audit

Python Pandas Matplotlib Streamlit Topic Focus License

A portfolio-grade survey audit that evaluates:

  • Coverage (collection window and timing signals)
  • Representativeness (sample composition + imbalance indicators)
  • Measurement risks (scale design pitfalls)
  • Data validity checks (ranges, allowed values, duplicates)
  • Open-text normalization (turn messy “stress causes” into a clean taxonomy)

This project is intentionally designed to answer a higher-level question than typical EDA:

“How much can we trust insights from this survey, and what would make them more trustworthy?”


Dataset (Source + Thanks)

This project uses the Kaggle dataset:

Thanks to the dataset author and Kaggle for sharing the data for learning and analysis.


Table of Contents


Why this project matters

Survey datasets often look “clean”:

  • no missing values,
  • simple distributions,
  • easy charts.

But surveys can still be analytically dangerous because:

  • the sample may be skewed (selection bias),
  • the collection window may reflect a narrow moment (exams),
  • the question design may inflate responses (measurement bias),
  • open-text answers create noise that blocks usable insights.

This project makes those risks visible and actionable.


What you get

Audit Report

A markdown report generated at:

  • reports/audit_report.md

It includes:

  • dataset overview
  • validity checks
  • representativeness checks
  • bias risk matrix
  • taxonomy summary
  • recommendations for improving the next survey run

Figures

Saved under:

  • reports/figures/

These figures are designed to be:

  • stakeholder-friendly
  • easy to interpret
  • honest about uncertainty & representativeness

Streamlit Dashboard

A simple dashboard at:

  • app/app.py

It loads your generated outputs and visuals, and displays the report inside the app.


Quickstart

1) Install dependencies

python -m venv .venv
# Windows:
# .venv\Scripts\activate
source .venv/bin/activate

pip install -r requirements.txt

2) Run the pipeline

python -m src.pipeline --input data/raw/student_survey.csv

3) Run the dashboard

streamlit run app/app.py

How the audit works

The pipeline performs these steps:

  1. Schema standardization
  • Survey exports often contain messy column names (spaces/newlines).

  • The pipeline maps them into canonical names like:

    • age_group, gender, education_level, academic_pressure, etc.
  1. Type parsing
  • Parses timestamps and numeric pressure values safely (errors="coerce").
  1. Validity checks
  • Academic pressure is expected in [1..5].
  • Stress frequency is validated against expected categories.
  • Timestamp parsing success rate is recorded.
  1. Coverage signals
  • If timestamps exist: the pipeline estimates the collection window and response burst patterns.
  1. Representativeness / imbalance
  • For demographics, computes:

    • max_share (how dominant the top category is)
    • entropy (diversity)
    • effective number of categories
  1. Text normalization
  • Open-text stress causes are cleaned and mapped into a taxonomy:

    • Exams & grades, Financial, Workload & time, etc.
  • Also exports a transparent mapping file for review.

  1. Reporting
  • Writes:

    • reports/audit_report.md
    • key CSV outputs
    • figures folder

Figures & interpretation

Important note: These charts describe the survey respondents, not the full student population. Use language like “in this sample” and avoid population claims unless sampling is representative.


1) Sample composition: Age Group

composition_age

What this shows

Counts of respondents by age group.

Why it matters (bias risk)

If one age group dominates the sample, you may accidentally conclude that:

  • “students” behave a certain way, when you really mean:
  • “students in the dominant age group in this sample.”

How to interpret safely

  • Treat this as a coverage map of who responded.
  • When comparing groups, avoid strong claims for groups with very low counts.

Actionable recommendation

If you want population-level insights, the next survey run should:

  • recruit across age groups intentionally (quotas or stratified sampling)
  • publish subgroup counts with every chart

2) Sample composition: Education Level

composition_education

What this shows

Counts of respondents by education level.

Why it matters

If “College” dominates, then:

  • pressure/stress distributions reflect college students more than others.

Common analytics trap

People interpret this as:

  • “College students are the majority in reality” But it might only mean:
  • “College students were more likely to see/respond to the survey.”

Actionable recommendation

  • Track recruitment channel(s).
  • Report response rate per group if possible.

3) Sample composition: Gender

composition_gender

What this shows

Counts of respondents by gender.

Why it matters

Even if the dataset is balanced, gender composition affects interpretation, especially for:

  • stress reporting tendencies
  • social desirability effects
  • willingness to respond

Safe interpretation

  • Use this chart to decide which subgroup comparisons are meaningful.
  • Always show counts next to percentages.

4) Academic pressure distribution (1–5)

pressure_distribution

What this shows

How many respondents selected each pressure level (1–5).

Key interpretation points

  • Pressure is ordinal (1 < 2 < 3 < 4 < 5), not a continuous measurement.
  • For small samples, medians and category shares are often more honest than complicated models.

What not to claim

  • Avoid causal language like “X causes high pressure.”
  • Avoid population prevalence claims unless the sample is representative.

What you can do

  • Describe where the sample concentrates (more 4–5 than 1–2).
  • Use it to form hypotheses for larger data collection.

5) Responses per day (collection window)

responses_per_day

What this shows

How many responses were submitted each day (based on timestamps).

Why it matters (coverage risk)

If responses cluster heavily in one day, the survey may capture:

  • a specific moment in time (exam week, deadlines, holidays), not a stable trend.

Interpretation rule

A short, bursty collection window increases the risk of:

  • time-window bias, and makes it harder to generalize.

Recommendation for future survey runs

  • Collect over multiple weeks.
  • Add an “exam period” or “major deadlines this week” question.

6) Sleep hours distribution

sleep_distribution

What this shows

Self-reported sleep category counts.

Why it matters

Sleep is often used as an explanatory factor, but here it is:

  • categorical and broad, which reduces precision.

Safe interpretation

  • Use it to describe the sample.
  • Avoid overinterpreting differences unless subgroup sizes are reasonable.

Recommendation

If you want stronger analysis:

  • collect sleep as a numeric value (hours)
  • optionally collect “sleep quality” too

7) Stress frequency distribution

stress_frequency_distribution

What this shows

Counts of stress frequency responses.

Measurement bias note (important)

If the survey scale lacks “Never” / “Rarely”, responses can be forced upward. That can inflate perceived stress frequency.

What to do in reporting

  • Explicitly mention scale design limitations.
  • Use phrases like “on this response scale” rather than “in reality”.

Improvement recommendation

Use a balanced scale:

  • Never / Rarely / Sometimes / Often / Always

8) Stress cause taxonomy (cleaned)

taxonomy_summary

What this shows

Open-text “main cause of stress” answers normalized into a small set of categories.

Why this step is powerful

Raw open-text answers often look like:

  • many unique strings, even when they describe the same idea (“exams”, “tests”, “quiz pressure”).

A taxonomy makes the signal readable and reduces noise.

Transparency / reproducibility

The raw-to-category mapping is exported to:

  • outputs/taxonomy_mapping.csv

So you can:

  • review decisions,
  • adjust rules,
  • and rerun the pipeline.

Recommendation for future surveys

Use a hybrid approach:

  • multi-select structured options + optional free text This preserves interpretability while still capturing new themes.

Decision safety rules

These are rules you can apply to keep analysis honest:

  • Always include subgroup counts in subgroup charts.

  • Avoid interpreting subgroup differences when n < 5.

  • Use wording:

    • “In this sample…”
    • “Students are…”
  • Treat the dataset as hypothesis-generating, not population definitive.

  • When in doubt, prefer:

    • counts + ranges
    • clear limitations over confident claims.

Outputs

After running:

python -m src.pipeline --input data/raw/student_survey.csv

You get:

Files

  • reports/audit_report.md | generated audit report
  • reports/figures/ | all charts
  • outputs/audit_summary.json | summary metadata + checks
  • outputs/bias_risk_matrix.csv | structured bias table
  • outputs/grouped_stats.csv | safe grouped summaries (means/medians)
  • outputs/taxonomy_mapping.csv | transparent text mapping

Customize / extend

Improve the taxonomy

Edit rules in:

  • src/text_taxonomy.py

Then rerun:

python -m src.pipeline --input data/raw/student_survey.csv

Add uncertainty (advanced but impressive)

If you want next-level reporting:

  • bootstrap confidence intervals for shares
  • Bayesian credible intervals for proportions

This is ideal for small N and looks very professional.

Fix Pandas warning (optional)

If you see a warning about infer_datetime_format:

  • remove that argument from pd.to_datetime(...) in src/clean_schema.py. New pandas versions don’t need it.

Notes & limitations

  • Small dataset → subgroup comparisons can be unstable.
  • Self-report → subject to social desirability and reporting bias.
  • Recruitment channel unknown → selection bias likely.
  • Missing scale options can inflate reported frequencies.

These limitations are not “bad news”, they are exactly what makes this project mature: you are showing that you can produce analytics that are safe to trust.

About

Portfolio-grade audit of a student mental health & academic pressure survey. Measures coverage and sample imbalance, runs validity checks, highlights measurement and selection bias risks, and converts messy open-text “stress causes” into a transparent taxonomy. Ships a Markdown report, figures, and a Streamlit dashboard.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages