Skip to content

Dataset updates #30

@briatte

Description

@briatte

Closes #21, #22 and #23 (copied below), #27.

Update from 2023

Stop updating the data, really.

Detailed notes

  • QOG: qog2023 -- since QOG 2023 is out
    • freeze: qog2019
    • would require rewriting code and looking at less clear results… see code at end of section
    • only advantage would be lower codebook size → just downsample the 2019 one, it only loses the intra-doc links
    • note the codebook issue! QOG 2020: make sure GDP documentation has been corrected #27
    • Perhaps simply drop the eu_* variables
  • GSS: gss7221 -- since GSS has updated too
    • freeze: gss7616 (but see below)
    • not fun to keep only one year: keep older years one old year too
    • possibly break down single data into yearly ones? restrict to 1976 and 2016
  • ESS: ess2008 -- in order to continue using torture question?
    • freeze: ess0816, or ess2008 and ess2016 (different codebooks, so it's fine)
    • keep using Round 4 for both torture example and health services ones (results are not as clear-cut with Round 8(
    • keep Round 8 to cover e.g. climate change
    • problem: DTA file is too large -- divide, to avoid _merge problem
    • document existence of ess2016 despite not in use anywhere in the course do-files
  • WVS: wvs9904 -- keep old version for sharia law question
    • update to last version, check encoding
    • possibly also include a more recent wave? (raises same question as ess2016)
  • NHIS: update to nhis202* recent year nhis1020?
    • check if sampling frame and variables have changed first
    • see below on how URL structure for fetching has changed

Note on QOG -- offers only this as a replacement in 2023, which is not ideal:

// school life expectancy
sc wdi_fertility wef_lse, ms(i) mlab(ccodealp) || lfit wdi_fertility wef_lse, ///
	name(g1, replace)
// linear fit + SSA data points only, underpredicted
sc wdi_fertility wef_lse if ht_region == 4, ms(i) mlab(ccodealp) || ///
	lfit wdi_fertility wef_lse, ///
	name(g2, replace)
// all regions
forv i = 1/10 {
	sc wdi_fertility wef_lse if ht_region == `i', ms(i) mlab(ccodealp) || ///
	lfit wdi_fertility wef_lse, ///
	name("region`i'", replace)
}

The plan for 2021:

Additional things to consider:

Dataset names

I like the initial "acronym + year" convention, but it produces strange names for multiple-year survey datasets:

  • ess1214 (not used) and ess0816
  • wvs9904 (unavoidable)
  • nhis1017 (unavoidable, unless we use a single year, but that removes any demo of keep if year)
  • gss7616 (unavoidable, unless we separate the years)

Merged datasets

Is it still a good idea to do that for e.g. ESS? Probably not, esp. if we need to limit datasets at 2,048 variables for Stata/IC.

  • Keep NHIS with multiple years. Use it to demo keep if year.
  • Keep WVS with multiple years (country-dependent).
  • Break down GSS.
  • Break down ESS.

Both WVS and ESS are used to demo keep if inlist(country, …), the other subset we want to show.

Additional datasets

It would make a lot of sense to have more datasets for the students to use than those used in the do-files.

Currently, the do-files are selective anyway: we provide ESS 2016 (Round 8) but do not use the data, even though the dependent variable also exists in that round.

  • GSS has a single codebook, so bundling many years would duplicate the codebook in the ZIP archives. Not ideal.
  • ESS could be broken down to Rounds 4 (2008), 8 (2016) and 9 (2018).

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions