Data Science using Stata

The full course (including data, code, and slides) is available for USD 19.99 on Udemy.

INTRO:

YouTube Video 1

Lecture 1: Introduction to statistical models and Stata

We explore statistical models and start playing with Stata. I show you how to load data into Stata from Excel or csv files.

YouTube Video 2

Lecture 2: Exploring data

I cover data types, sampling issues, outliers, and missing values.

YouTube Video 3

2a: Transforming variables

I discuss whether you should transform variables (e.g., log transformation), how transformation affects linear relationships, and whether variables should be normally distributed for regression analysis. I show you how to AVOID COMMON MISTAKES when transforming variables.

YouTube Video 4

2b: Exporting tables

I introduce the estpost and esttab commands, which enable you to export tables from Stata to Word, Excel, or other applications. I show you how to modify formats and optimise the layout. This produces production-ready tables for your dissertation project, consulting report or academic paper. NO NEED TO ADJUST TABLES BY HAND - LET STATA TAKE CARE OF IT!

YouTube Video 5

Workshop 1: Descriptive Analysis: Worked Example

Now it is your turn! Download the data and try to answer the questions for Workshop 1 (see slides). This video will walk you through a Descriptive Data Analysis step-by-step. We generate new variables, display descriptive statistics, and explore large survey data.

YouTube Video 6

Lecture 3: Regression analysis

This video explains Regression Analysis without using theory. We will conduct a regression analysis in Stata and interpret the output. In particular, we explore correlations, scatter plots, linear models, OLS, dummies, and predictions.

YouTube Video 7

Chapters

0:00 Welcome & Overview
1:19 Correlations & Scatter Plots
6:23 Distributions & Transformations
7:04 Linear Model
9:48 Ordinary Least Squares (OLS)
16:59 Application using Stata
38:12 Regression Output & Interpretation
46:53 Dummy Variables
52:52 Fitted Values
55:55 Model Assumptions

3a: What are Degrees of Freedom?

This video explains the concept of degrees of freedom. Using artificial data, we illustrate the minimum number of observations needed to determine a regression line in two dimensions (or higher dimensions). We show the impact on R-squared and demonstrate the adjusted R-squared. Using examples, we highlight the impact of additional observations and explanatory variables on degrees of freedom and R-squared.

YouTube Video 8

Lecture 4: Post estimation analysis

4a: What is Multicollinearity?

This video explains multicollinearity and its consequences for regression analysis. We discuss how to detect multicollinearity and how to address the problem. Finally, we demonstrate multicollinearity using data on commodity prices.

YouTube Video 9

Chapters

0:00 Multicollinearity
0:24 What is the problem?
2:43 How to fix it?
4:50 How to detect multicollinearity?
7:30 Example in Stata

4b: What is Heteroskedasticity?

This video explains heteroskedasticity and its consequences for regression analysis. We discuss how to detect heteroskedasticity and how to address the problem. Finally, we demonstrate heteroskedasticity using data on yields in farming.

YouTube Video 10

Chapters

0:00 Welcome
0:44 Impact on p-values
2:07 Detecting the problem
4:56 How to fix it?
5:42 Worked example in Stata

4c: How to fix an Omitted Variable Bias?

This video explains how to detect and fix an omitted variable bias. If you forget to include an important explanatory variable in your regression model, an omitted variable bias can occur. I explain how you can detect this problem using the Ramsey RESET test. This test also indicates non-linear relationships. We will explore how we can distinguish between non-linear effects and omitted variables using fitted values.

YouTube Video 11

Chapters

0:00 Omitted Variable Bias
1:34 Worked Example in Stata
3:55 Log Transformation
5:08 Regression Model
6:50 Ramsey RESET Test
9:10 Higher Orders
15:36 Collapse Command
17:01 Visualisation

4d: Detecting Endogeneity

This video explains how to detect endogeneity. Endogeneity is a common problem in regression analysis. I explain how you can detect this problem using an auxiliary regression approach. We discuss strategies to address endogeneity.

YouTube Video 12

Chapters

0:00 Welcome
0:15 What is Endogeneity?
1:42 Detecting Endogeneity
3:08 Worked Example in Stata
11:33 How to fix Endogeneity?

Lecture 5: Analysing panel data

5a: Introduction to Panel Data

This video explains how to work with panel data. We discuss the benefits of using panel data, including Granger causality and the assessment of policy changes. We introduced fixed and random effects models, which we implement in Stata. The regression otuputs are explained and compared.

YouTube Video 13

Chapters

0:00 Introduction to Panel Data
0:26 Benefits of Panel Data
1:23 Analysing Policy Changes
1:58 Causality
3:01 Time Lags
3:19 Panel Data Models
3:59 SOLS or POLS
4:20 Fixed & Random Effects
7:11 Worked Example in Stata
8:40 Panel Regressions in Stata
9:36 The tsset Command
11:07 Interpretation of Output
13:46 Model Comparison

5b: Fixed or Random Effects? Does the Hausman Test fail?

This video discusses whether you should use fixed or random effects for your panel data analysis. We explain how the Hausman test works and - most importantly - when the Hausman test fails! We cover biased estimators, the efficiency of estimators, and the implementation in Stata. Again, I focus on an intuitive understanding of the methods - no theory - just data fun!

YouTube Video 14

Chapters

0:00 Fixed or Random Effects
0:26 Worked Example
0:53 How does the Hausman Test work?
1:12 Bias
1:45 Efficiency
3:21 Implementation in Stata
4:42 Interpretation of Output
6:23 Warning: Hausman Test fails!

5c: Serial Correlation in Panel Data

This video explains the impact of serial correlation in panel data analysis. We discuss the underlying reasons for serial correlation. Then we introduce a test based on Wooldridge (2002). To fix serial correlation, we explore the Newey-West Estimator (robust estimation) and Dynamic Panel Data Estimation. Finally, we have some fun in Stata.

YouTube Video 15

Chapters

0:00 Serial Correlation in Panel Data
0:40 Reasons for Serial Correlation
1:19 Testing for Serial Correlation
2:46 Newey-West Estimator
3:47 Dynamic Panel Data Estimation
4:12 Worked Example in Stata
5:26 Interpretation of Output
6:13 Solutions in Stata

5d: Interaction Effects in Panel Data

This video explains interaction effects in panel data. It is common that certain groups of observations (e.g., companies, countries) exhibit differences in behaviour. These differences can be modelled using interaction effects. We explore shifts in the intercept and slope coefficient. In addition, I demonstrate how these models can be implemented in Stata.

YouTube Video 16

Chapters

0:00 Interaction Effects
1:13 Shift in Intercept
2:21 Illustration of Shift
2:40 Interaction Term
4:08 Illustration of Interaction Effect
4:31 Implementation in Stata

5e: How does the Test for Serial Correlation work?

This video comes with a TRIGGER WARNING! It contains mathematics, which some viewers might find distressing. I explain how the serial correlation test developed by Wooldridge (2002) can be derived. We cover the null hypothesis and related assumptions, iid distributed error terms, covariance and variance formulas. We also highlight linear operators and their properties. There is a little surprise at the end of the video!

YouTube Video 17

Lecture 6: Binary choice models

6a: Logistic Regression: An Introduction using Stata

This video introduces logistic regressions. We discuss binary choice models, where the dependent variable is either a positive or negative outcome (e.g., a decision). The problem is illustrated graphically - how to map a linear model to an interval suitable for modelling a probability. Most decision processes remain unobserved; hence, we briefly discuss latent variables. Finally, I demonstrate how these models can be implemented in Stata. Predicted probabilities are plotted to visualise the model, and we explore classifications.

YouTube Video 18

Chapters

0:00 Binary Choice
1:34 Illustration of Problem
5:38 Latent Variable
9:48 Implementation in Stata
14:43 Plot Predicted Probabilities
17:51 Classification

6b: How to Predict Mergers using Logistic Regressions?

This video explores a dataset of mergers (companies buying other companies). It is often interesting to predict whether a merger occurs as share prices tend to move. First, we explore the data, select variables, and visualise the trend of mergers in the US. You will learn new Stata commands to summarize data using collapse. Second, we run several logit models and derive predicted probabilities. Finally, we compare predictions based on firm-level data and macro data (merger wave). If you want to know more about mergers, have a look at our paper on "Endogenous mergers: bidder momentum and market reaction."

Link to Paper

YouTube Video 19

Chapters

0:00 Predicting Mergers
1:14 Exploring Data
2:44 Sum Command
4:12 Density Plot
4:30 Tabstat Command
5:46 Collapse Command
9:01 Sorting and By Command
11:07 Logit Models
17:05 Compare Predictions

Lecture 7: Model specification

7a: How to find the 'Best Model' for your Data?

This video explains the process of model specification, which is often overlooked in textbooks and many online courses. However, it is essential to understand how you actually derive the 'best model' for your data. We start by exploring different aims of studies, including forecasting and identification. The main approaches: General-to-Specific and Specific-to-General are introduced. We discuss the pros and cons of each approach. We explain the use of information criteria (AIC, BIC). Finally, we apply our knowledge to predicting stock market returns using a set of macroeconomic shock variables.

YouTube Video 20

Chapters

0:00 Model Specification
0:31 Aims of Video
1:59 What is the 'Best Model'?
3:36 How to start?
5:03 Specification Methods
7:47 Information Criteria
9:10 Predicting Stock Market Returns
11:23 General-to-Specific Approach

7b: Parameter Stability & Time-varying Coefficients

This video goes deeper into Stata programming. We illustrate time-varying coefficients in regressions. This is an issue in time series analysis aimed at forecasting. How can you forecast if your model exhibits parameter instability? We illustrate the problem and our approach using overlapping periods. The implementation in Stata highlights the differences between the matrix and variable environment. We move between the two using the svmat command. Time-varying coefficients are plotted, and a structural break is highlighted.

YouTube Video 21

Chapters

0:00 Parameter Stability
0:32 Illustration of Problem
2:52 Worked Example in Stata
4:06 Obtain Coefficients
5:03 Variable or Matrix in Stata
7:38 The svmat Command
9:52 The egen max() Trick
10:55 Fovalues Loop
14:55 Plotting Rolling Regression

Live 1: Exploring Data & Regression Analysis

This is our first live event dedicated to data analysis using Stata. We explore a cross-country dataset of macroeconomic variables. We try to model the impact of inflation on economic growth and explore non-linear effects.

YouTube Video 22 - 10/05/2022 at 1pm GMT - LIVE

Lecture 8: Measuring the immeasurable: CFA and SEM!

L8a: An Introduction to Confirmatory Factor Analysis in 8 min

This video provides a brief introduction to Confirmatory Factor Analysis (CFA). We discuss social constructs that cannot be easily measured. In practice, many concepts (e.g., overconfidence) cannot be observed directly (latent variables). These latent variables can be measured indirectly based on a set of factors that can be observed. We show that index construction, which is common, can be misleading. We discuss various ways to reduce dimensions, which is nowadays part of the machine learning (ML) literature. The methods include principal component analysis (PCA) and confirmatory factor analysis (CFA). Examples refer to our paper "Defining and measuring financial inclusion: A systematic review and confirmatory factor analysis".

Link to Paper

YouTube Video 23

Chapters

0:00 Introduction to CFA
0:11 Example: Financial Inclusion
0:41 Measure Latent Variable
1:59 Factors
2:57 Index Construction
3:59 Reduce Dimensions
4:46 PCA
5:50 CFA
6:13 Measurement Model

L8b: Conducting a Confirmatory Factor Analysis in Stata

This video provides a step-by-step guide to conducting a Confirmatory Factor Analysis (CFA) in Stata. We introduce the sem command and explain the syntax for a measurement model. The models are estimated and post estimation analysis based on goodness of fir measures is conducted. If the RMSEA is larger than 0.05 and the CFI is below 0.95, adding covariances between error terms can be beneficial. To identify the most promising covariances to add, we calculate the Modification Index (MI). Examples refer to our paper "Defining and measuring financial inclusion: A systematic review and confirmatory factor analysis."

Link to Paper

YouTube Video 24

Chapters

0:00 How to estimate a CFA in Stata?
0:29 Illustration of Model
1:03 Model Fit
2:05 Modification Index
2:39 Advanced Topics in SEM
3:06 Data on Financial Inclusion
4:06 The sem Command
6:49 Post Estimation Analysis

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
Lecture1		Lecture1
Lecture2		Lecture2
Lecture3		Lecture3
Lecture4		Lecture4
Lecture5		Lecture5
Lecture6		Lecture6
Lecture7		Lecture7
Lecture8		Lecture8
Live1		Live1
Workshop1		Workshop1
LICENSE		LICENSE
README.md		README.md

License

GerhardKling/DataScienceStata

Folders and files

Latest commit

History

Repository files navigation

Data Science using Stata

INTRO:

Lecture 1: Introduction to statistical models and Stata

Lecture 2: Exploring data

2a: Transforming variables

2b: Exporting tables

Workshop 1: Descriptive Analysis: Worked Example

Lecture 3: Regression analysis

3a: What are Degrees of Freedom?

Lecture 4: Post estimation analysis

4a: What is Multicollinearity?

4b: What is Heteroskedasticity?

4c: How to fix an Omitted Variable Bias?

4d: Detecting Endogeneity

Lecture 5: Analysing panel data

5a: Introduction to Panel Data

5b: Fixed or Random Effects? Does the Hausman Test fail?

5c: Serial Correlation in Panel Data

5d: Interaction Effects in Panel Data

5e: How does the Test for Serial Correlation work?

Lecture 6: Binary choice models

6a: Logistic Regression: An Introduction using Stata

6b: How to Predict Mergers using Logistic Regressions?

Lecture 7: Model specification

7a: How to find the 'Best Model' for your Data?

7b: Parameter Stability & Time-varying Coefficients

Live 1: Exploring Data & Regression Analysis

Lecture 8: Measuring the immeasurable: CFA and SEM!

L8a: An Introduction to Confirmatory Factor Analysis in 8 min

L8b: Conducting a Confirmatory Factor Analysis in Stata

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages