Skip to content

Material for the YouTube course "Data Science using Stata: Complete Beginners Course", which is available on my channel YUNIKARN.

License

Notifications You must be signed in to change notification settings

siyaparab18/DataScienceStata

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Science using Stata

The full course (including data, code, and slides) is available for USD 19.99 on Udemy.

INTRO:

Lecture 1: Introduction to statistical models and Stata

We explore statistical models and start playing with Stata. I show you how to load data into Stata from Excel or csv files.

Lecture 2: Exploring data

I cover data types, sampling issues, outliers, and missing values.

2a: Transforming variables

I discuss whether you should transform variables (e.g., log transformation), how transformation affects linear relationships, and whether variables should be normally distributed for regression analysis. I show you how to AVOID COMMON MISTAKES when transforming variables.

2b: Exporting tables

I introduce the estpost and esttab commands, which enable you to export tables from Stata to Word, Excel, or other applications. I show you how to modify formats and optimise the layout. This produces production-ready tables for your dissertation project, consulting report or academic paper. NO NEED TO ADJUST TABLES BY HAND - LET STATA TAKE CARE OF IT!

Workshop 1: Descriptive Analysis: Worked Example

Now it is your turn! Download the data and try to answer the questions for Workshop 1 (see slides). This video will walk you through a Descriptive Data Analysis step-by-step. We generate new variables, display descriptive statistics, and explore large survey data.

Lecture 3: Regression analysis

This video explains Regression Analysis without using theory. We will conduct a regression analysis in Stata and interpret the output. In particular, we explore correlations, scatter plots, linear models, OLS, dummies, and predictions.

Chapters

  • 0:00 Welcome & Overview
  • 1:19 Correlations & Scatter Plots
  • 6:23 Distributions & Transformations
  • 7:04 Linear Model
  • 9:48 Ordinary Least Squares (OLS)
  • 16:59 Application using Stata
  • 38:12 Regression Output & Interpretation
  • 46:53 Dummy Variables
  • 52:52 Fitted Values
  • 55:55 Model Assumptions

3a: What are Degrees of Freedom?

This video explains the concept of degrees of freedom. Using artificial data, we illustrate the minimum number of observations needed to determine a regression line in two dimensions (or higher dimensions). We show the impact on R-squared and demonstrate the adjusted R-squared. Using examples, we highlight the impact of additional observations and explanatory variables on degrees of freedom and R-squared.

Lecture 4: Post estimation analysis

4a: What is Multicollinearity?

This video explains multicollinearity and its consequences for regression analysis. We discuss how to detect multicollinearity and how to address the problem. Finally, we demonstrate multicollinearity using data on commodity prices.

Chapters

  • 0:00 Multicollinearity
  • 0:24 What is the problem?
  • 2:43 How to fix it?
  • 4:50 How to detect multicollinearity?
  • 7:30 Example in Stata

4b: What is Heteroskedasticity?

This video explains heteroskedasticity and its consequences for regression analysis. We discuss how to detect heteroskedasticity and how to address the problem. Finally, we demonstrate heteroskedasticity using data on yields in farming.

Chapters

  • 0:00 Welcome
  • 0:44 Impact on p-values
  • 2:07 Detecting the problem
  • 4:56 How to fix it?
  • 5:42 Worked example in Stata

4c: How to fix an Omitted Variable Bias?

This video explains how to detect and fix an omitted variable bias. If you forget to include an important explanatory variable in your regression model, an omitted variable bias can occur. I explain how you can detect this problem using the Ramsey RESET test. This test also indicates non-linear relationships. We will explore how we can distinguish between non-linear effects and omitted variables using fitted values.

Chapters

  • 0:00 Omitted Variable Bias
  • 1:34 Worked Example in Stata
  • 3:55 Log Transformation
  • 5:08 Regression Model
  • 6:50 Ramsey RESET Test
  • 9:10 Higher Orders
  • 15:36 Collapse Command
  • 17:01 Visualisation

4d: Detecting Endogeneity

This video explains how to detect endogeneity. Endogeneity is a common problem in regression analysis. I explain how you can detect this problem using an auxiliary regression approach. We discuss strategies to address endogeneity.

Chapters

  • 0:00 Welcome
  • 0:15 What is Endogeneity?
  • 1:42 Detecting Endogeneity
  • 3:08 Worked Example in Stata
  • 11:33 How to fix Endogeneity?

Lecture 5: Analysing panel data

5a: Introduction to Panel Data

This video explains how to work with panel data. We discuss the benefits of using panel data, including Granger causality and the assessment of policy changes. We introduced fixed and random effects models, which we implement in Stata. The regression otuputs are explained and compared.

Chapters

  • 0:00 Introduction to Panel Data
  • 0:26 Benefits of Panel Data
  • 1:23 Analysing Policy Changes
  • 1:58 Causality
  • 3:01 Time Lags
  • 3:19 Panel Data Models
  • 3:59 SOLS or POLS
  • 4:20 Fixed & Random Effects
  • 7:11 Worked Example in Stata
  • 8:40 Panel Regressions in Stata
  • 9:36 The tsset Command
  • 11:07 Interpretation of Output
  • 13:46 Model Comparison

5b: Fixed or Random Effects? Does the Hausman Test fail?

This video discusses whether you should use fixed or random effects for your panel data analysis. We explain how the Hausman test works and - most importantly - when the Hausman test fails! We cover biased estimators, the efficiency of estimators, and the implementation in Stata. Again, I focus on an intuitive understanding of the methods - no theory - just data fun!

Chapters

  • 0:00 Fixed or Random Effects
  • 0:26 Worked Example
  • 0:53 How does the Hausman Test work?
  • 1:12 Bias
  • 1:45 Efficiency
  • 3:21 Implementation in Stata
  • 4:42 Interpretation of Output
  • 6:23 Warning: Hausman Test fails!

5c: Serial Correlation in Panel Data

This video explains the impact of serial correlation in panel data analysis. We discuss the underlying reasons for serial correlation. Then we introduce a test based on Wooldridge (2002). To fix serial correlation, we explore the Newey-West Estimator (robust estimation) and Dynamic Panel Data Estimation. Finally, we have some fun in Stata.

Chapters

  • 0:00 Serial Correlation in Panel Data
  • 0:40 Reasons for Serial Correlation
  • 1:19 Testing for Serial Correlation
  • 2:46 Newey-West Estimator
  • 3:47 Dynamic Panel Data Estimation
  • 4:12 Worked Example in Stata
  • 5:26 Interpretation of Output
  • 6:13 Solutions in Stata

5d: Interaction Effects in Panel Data

This video explains interaction effects in panel data. It is common that certain groups of observations (e.g., companies, countries) exhibit differences in behaviour. These differences can be modelled using interaction effects. We explore shifts in the intercept and slope coefficient. In addition, I demonstrate how these models can be implemented in Stata.

Chapters

  • 0:00 Interaction Effects
  • 1:13 Shift in Intercept
  • 2:21 Illustration of Shift
  • 2:40 Interaction Term
  • 4:08 Illustration of Interaction Effect
  • 4:31 Implementation in Stata

5e: How does the Test for Serial Correlation work?

This video comes with a TRIGGER WARNING! It contains mathematics, which some viewers might find distressing. I explain how the serial correlation test developed by Wooldridge (2002) can be derived. We cover the null hypothesis and related assumptions, iid distributed error terms, covariance and variance formulas. We also highlight linear operators and their properties. There is a little surprise at the end of the video!

Lecture 6: Binary choice models

6a: Logistic Regression: An Introduction using Stata

This video introduces logistic regressions. We discuss binary choice models, where the dependent variable is either a positive or negative outcome (e.g., a decision). The problem is illustrated graphically - how to map a linear model to an interval suitable for modelling a probability. Most decision processes remain unobserved; hence, we briefly discuss latent variables. Finally, I demonstrate how these models can be implemented in Stata. Predicted probabilities are plotted to visualise the model, and we explore classifications.

Chapters

  • 0:00 Binary Choice
  • 1:34 Illustration of Problem
  • 5:38 Latent Variable
  • 9:48 Implementation in Stata
  • 14:43 Plot Predicted Probabilities
  • 17:51 Classification

6b: How to Predict Mergers using Logistic Regressions?

This video explores a dataset of mergers (companies buying other companies). It is often interesting to predict whether a merger occurs as share prices tend to move. First, we explore the data, select variables, and visualise the trend of mergers in the US. You will learn new Stata commands to summarize data using collapse. Second, we run several logit models and derive predicted probabilities. Finally, we compare predictions based on firm-level data and macro data (merger wave). If you want to know more about mergers, have a look at our paper on "Endogenous mergers: bidder momentum and market reaction."

Chapters

  • 0:00 Predicting Mergers
  • 1:14 Exploring Data
  • 2:44 Sum Command
  • 4:12 Density Plot
  • 4:30 Tabstat Command
  • 5:46 Collapse Command
  • 9:01 Sorting and By Command
  • 11:07 Logit Models
  • 17:05 Compare Predictions

Lecture 7: Model specification

7a: How to find the 'Best Model' for your Data?

This video explains the process of model specification, which is often overlooked in textbooks and many online courses. However, it is essential to understand how you actually derive the 'best model' for your data. We start by exploring different aims of studies, including forecasting and identification. The main approaches: General-to-Specific and Specific-to-General are introduced. We discuss the pros and cons of each approach. We explain the use of information criteria (AIC, BIC). Finally, we apply our knowledge to predicting stock market returns using a set of macroeconomic shock variables.

Chapters

  • 0:00 Model Specification
  • 0:31 Aims of Video
  • 1:59 What is the 'Best Model'?
  • 3:36 How to start?
  • 5:03 Specification Methods
  • 7:47 Information Criteria
  • 9:10 Predicting Stock Market Returns
  • 11:23 General-to-Specific Approach

7b: Parameter Stability & Time-varying Coefficients

This video goes deeper into Stata programming. We illustrate time-varying coefficients in regressions. This is an issue in time series analysis aimed at forecasting. How can you forecast if your model exhibits parameter instability? We illustrate the problem and our approach using overlapping periods. The implementation in Stata highlights the differences between the matrix and variable environment. We move between the two using the svmat command. Time-varying coefficients are plotted, and a structural break is highlighted.

Chapters

  • 0:00 Parameter Stability
  • 0:32 Illustration of Problem
  • 2:52 Worked Example in Stata
  • 4:06 Obtain Coefficients
  • 5:03 Variable or Matrix in Stata
  • 7:38 The svmat Command
  • 9:52 The egen max() Trick
  • 10:55 Fovalues Loop
  • 14:55 Plotting Rolling Regression

Live 1: Exploring Data & Regression Analysis

This is our first live event dedicated to data analysis using Stata. We explore a cross-country dataset of macroeconomic variables. We try to model the impact of inflation on economic growth and explore non-linear effects.

Lecture 8: Measuring the immeasurable: CFA and SEM!

L8a: An Introduction to Confirmatory Factor Analysis in 8 min

This video provides a brief introduction to Confirmatory Factor Analysis (CFA). We discuss social constructs that cannot be easily measured. In practice, many concepts (e.g., overconfidence) cannot be observed directly (latent variables). These latent variables can be measured indirectly based on a set of factors that can be observed. We show that index construction, which is common, can be misleading. We discuss various ways to reduce dimensions, which is nowadays part of the machine learning (ML) literature. The methods include principal component analysis (PCA) and confirmatory factor analysis (CFA). Examples refer to our paper "Defining and measuring financial inclusion: A systematic review and confirmatory factor analysis".

Chapters

  • 0:00 Introduction to CFA
  • 0:11 Example: Financial Inclusion
  • 0:41 Measure Latent Variable
  • 1:59 Factors
  • 2:57 Index Construction
  • 3:59 Reduce Dimensions
  • 4:46 PCA
  • 5:50 CFA
  • 6:13 Measurement Model

L8b: Conducting a Confirmatory Factor Analysis in Stata

This video provides a step-by-step guide to conducting a Confirmatory Factor Analysis (CFA) in Stata. We introduce the sem command and explain the syntax for a measurement model. The models are estimated and post estimation analysis based on goodness of fir measures is conducted. If the RMSEA is larger than 0.05 and the CFI is below 0.95, adding covariances between error terms can be beneficial. To identify the most promising covariances to add, we calculate the Modification Index (MI). Examples refer to our paper "Defining and measuring financial inclusion: A systematic review and confirmatory factor analysis."

Chapters

  • 0:00 How to estimate a CFA in Stata?
  • 0:29 Illustration of Model
  • 1:03 Model Fit
  • 2:05 Modification Index
  • 2:39 Advanced Topics in SEM
  • 3:06 Data on Financial Inclusion
  • 4:06 The sem Command
  • 6:49 Post Estimation Analysis

About

Material for the YouTube course "Data Science using Stata: Complete Beginners Course", which is available on my channel YUNIKARN.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Stata 100.0%