This project explores the socio-economic and health-related factors influencing household tobacco consumption in the context of a national budget survey. Through a series of descriptive statistics, regression analyses, and diagnostic tests, it identifies key relationships and evaluates model performance while accounting for potential specification errors.
data1.dta: Primary dataset containing household budget datahealth.dta: Health expenditure records to proxy tobacco-related harm
- Analyze tobacco consumption across households using economic and demographic indicators
- Determine the statistical significance and magnitude of those indicators using regression modeling
- Address modeling limitations such as normality, heteroscedasticity, and omitted variable bias
- Include health-related proxies to enrich the model
- Loaded
.dtadatasets usinghaven - Identified non-binary variables for statistical analysis
- Merged health data with primary household dataset using
hhid
- Summarized income, age, education, unit price, and tobacco consumption
- Notable correlations:
- Age
↔️ Education: −0.3486 - Education
↔️ Income: 0.4200 - Education
↔️ Unit Price: 0.5353 - Unit Price
↔️ Income: 0.5028
- Age
- Household Income vs Tobacco Consumption: No strong linear pattern; higher variance for low-income households
- Unit Price vs Tobacco Consumption: Clear inverse relationship, price sensitivity evident
Used Shapiro-Wilk and histograms for:
income: Strongly non-normal (p ≈ 0.444)unitvalue: Also non-normal distribution
weight ~ income + unitvalue + age + female + leduc + own + child_less_14