ETL Pipeline

How the 36-year CIA World Factbook archive was built from three distinct source formats.

Overview

The CIA changed their publication format multiple times over 36 years, requiring three separate ETL pipelines:

Era	Years	Source	Parser	Script
Text	1990-1999, 2001	Project Gutenberg	6 text format variants	`load_gutenberg_years.py`
HTML	2000, 2002-2020	Wayback Machine	5 HTML parser generations	`build_archive.py`
JSON	2021-2025	GitHub (factbook/cache.factbook.json)	Direct JSON parsing	`reload_json_years.py`

Pipeline 1: Project Gutenberg Text (1990-1999, 2001)

Source

The CIA World Factbook was published as plain text via Project Gutenberg throughout the 1990s. Each year has a different ebook ID.

Format Variants

The text format changed frequently:

Years	Format	Country Marker	Section Marker	Field Format
1990	old	`Country: Name`	`- Section`	`Field: value`
1991	tagged	`_@_Name`	`_*_Section`	`_#_Field: value`
1992	colon	`:Name Section`	(embedded)	Indented values
1993-1994	asterisk	`*Name, Section`	(embedded)	Indented values
1995-1999	atsign/atsign_bare	`@Name:Section` or `@Name` + bare headers	Inline/indented mix	Mixed
2001	equals	`@Name`	`Name Section` (tab-separated)	Inline fields

Process

Download text file from Project Gutenberg
Strip PG header/footer wrapper
Parse with year-specific parser
Match country names to MasterCountries using fuzzy matching
Insert into database

Why 2001 uses text

The 2001 HTML zip from the Wayback Machine was corrupted. The Project Gutenberg text edition serves as a reliable fallback.

1996 Data Repair: CIA Original Text

The Gutenberg edition for 1996 (ebook #27675) is truncated for 7 sovereign countries (Venezuela, Armenia, Greece, Luxembourg, Malta, Monaco, Tuvalu). Their entries end mid-sentence in the source file.

The CIA's own complete text file (wfb-96.txt.gz) was recovered from the Wayback Machine at odci.gov/cia/publications/nsolo/ (captured May 1997). This 3.8 MB file uses a different format from the Gutenberg version: page headers (FACTBOOK COUNTRY REPORT Page NNN), centered country/section names, and 5-space indented fields with 10-space indented values.

The repair script (etl/repair_1996_truncated.py) parses the CIA original and replaces the 7 truncated entries in the database. The CIA original is stored at samples/text_samples/1996_cia_original.txt.

Note: 1996 is the only year where the CIA published a downloadable text file. For 1990-1995 and 1997-1999, no CIA-original download archives exist -- Project Gutenberg is the only bulk text source. Validation confirmed those years have no truncation issues.

Pipeline 2: Wayback Machine HTML (2000, 2002-2020)

Source

CIA published downloadable zip archives of the World Factbook at cia.gov/the-world-factbook/about/archives/download/factbook-YYYY.zip. These are preserved in the Wayback Machine.

HTML Format Generations

Years	Format	Key Markers
2000	Classic	`<b>FieldName:</b>` with `<a name="Geo">` section anchors
2001-2008	Table	`<td class="FieldLabel">` with `<a name="...">` sections
2009-2014	CollapsiblePanel	`<div class="CollapsiblePanel">` with `<span class="category">`
2015-2017	ExpandCollapse	`<h2 class="question" sectiontitle="...">` with field divs
2018-2020	Modern	`<li id="...-category-section-anchor">` with `<div id="field-...">`

Process

Download zip from Wayback Machine (using known-good timestamps)
If known timestamp fails, CDX API lookup for alternative
Extract HTML files from geos/ directory within zip
Parse with year-appropriate parser (auto-selected)
Extract country name from <title> tag (format varies by year)
Insert categories and fields into database
Delete zip to save space

Wayback Machine Timestamps

Each year has a verified Wayback Machine timestamp that returns a valid zip. These were discovered through manual testing and CDX API searches.

Pipeline 3: GitHub JSON (2021-2025)

Source

The factbook/cache.factbook.json repository was auto-updated weekly (every Thursday) from August 2021 until the Factbook's discontinuation.

Year-Specific Snapshots

Rather than loading the same JSON snapshot for all years, we use git history to find the last commit before each year-end:

Year	Cutoff Date	Description
2021	2022-01-01	Last commit of 2021
2022	2023-01-01	Last commit of 2022
2023	2024-01-01	Last commit of 2023
2024	2025-01-01	Last commit of 2024
2025	2026-02-04	Final commit (CIA discontinued the Factbook)

Process

Clone or fetch the factbook/cache.factbook.json repository
For each year, find the last git commit before the cutoff date
Check out that commit
Parse all {region}/{code}.json files
Extract name, code, categories[].title, categories[].fields[].name, categories[].fields[].content
Strip HTML tags from content
Snapshot existing MasterCountryID links, delete old year data, insert new
Restore repo to master branch

Post-ETL Processing

Field Name Standardization (`build_field_mappings.py`)

After all data is loaded, this script analyzes the 1,090 distinct field names across all years and maps them to 414 canonical names. The mapping uses a priority-ordered rule system — the first rule that matches wins.

Rule Priority (applied in order)

Priority	Rule	Description
1	Identity	If the field name exists in 2024/2025 data and isn't in `KNOWN_RENAMES`, keep it as-is
2	Dash normalization	Fix formatting: `Economy-overview` or `Economy--overview` → `Economy - overview`
3	Known CIA renames (`KNOWN_RENAMES` dict)	159 manually curated old→new mappings based on CIA's own rename history
4	Consolidation (`CONSOLIDATION_MAP` dict)	48 sub-fields that should roll up to a parent (e.g. oil sub-fields → Petroleum)
5	Government body detection	Country-specific legislative body names (containing "Assembly", "Senate", "Parliament", etc.) → `Legislative branch`
6	Regional sub-entries (`REGIONAL_ENTRIES` set)	Cyprus Turkish/Greek area splits, Serbia/Montenegro splits, etc. — kept as-is but flagged
7	Noise detection (`is_noise()` function)	Parser artifacts, fragments, sub-field labels — flagged `IsNoise=1`
8	Manual fallback	Anything unmatched is kept as-is and flagged `manual` for review

The `KNOWN_RENAMES` Dictionary

This is the core mapping of 159 CIA field name changes across the 36-year span. It was built by:

Comparing the field names in 2024/2025 (the "modern" set) against earlier years
Identifying fields that disappeared and matching them to their modern equivalents
Consulting CIA Factbook documentation notes about field restructuring

Examples from the dictionary:

KNOWN_RENAMES = {
    # Pre-standard era (1990s) → modern names
    "National product":                     "Real GDP (purchasing power parity)",
    "National product per capita":          "Real GDP per capita",
    "Defense expenditures":                 "Military expenditures",
    "Ethnic divisions":                     "Ethnic groups",
    "Comparative area":                     "Area - comparative",
    "Telecommunications":                   "Telecommunication systems",

    # Modern-era renames (CIA renamed for clarity)
    "Economy - overview":                   "Economic overview",
    "GDP (purchasing power parity)":        "Real GDP (purchasing power parity)",
    "GDP - real growth rate":               "Real GDP growth rate",
    "Military branches":                    "Military and security forces",
    "Elevation extremes":                   "Elevation",
    "Telephone system":                     "Telecommunication systems",

    # Broadcast media consolidation
    "Radio broadcast stations":             "Broadcast media",
    "Television broadcast stations":        "Broadcast media",
    ...  # 159 entries total
}

The `CONSOLIDATION_MAP` Dictionary

48 sub-fields that should be grouped under a parent field for analysis. The sub-field data is preserved as-is in the database, but the ConsolidatedTo column in FieldNameMappings indicates the parent:

CONSOLIDATION_MAP = {
    # Oil/Petroleum (12 sub-fields → "Petroleum")
    "Oil - production":                     "Petroleum",
    "Crude oil - production":               "Petroleum",
    "Refined petroleum products - exports": "Petroleum",

    # Natural gas (5 sub-fields → "Natural gas")
    "Natural gas - production":             "Natural gas",
    "Natural gas - consumption":            "Natural gas",

    # Electricity (5 sub-fields → "Electricity")
    "Electricity - production":             "Electricity",
    "Electricity - consumption":            "Electricity",

    # Military personnel (7 sub-fields)
    "Military manpower - availability":     "Military and security service personnel strengths",
    "Manpower fit for military service":    "Military and security service personnel strengths",

    # Maritime claims (6 sub-fields)
    "Contiguous zone":                      "Maritime claims",
    "Continental shelf":                    "Maritime claims",
    "Exclusive economic zone":              "Maritime claims",
    ...  # 48 entries total
}

Noise Detection (`is_noise()` function)

The is_noise() function uses multiple heuristics to identify parser artifacts:

Single/two-letter entries — fragments like "A", "US" (as field names, not country data)
Lowercase-starting entries — almost always sub-field fragments from 1990s parsing
Entries with known noise phrases — "consists mainly of", "includes the following", etc.
Very long entries (>80 chars) with low use count — descriptive text accidentally captured as field names
Sub-field labels from 1994 — "adjective", "cabinet", "chancery" etc. appeared as standalone fields due to the HTML structure that year
Abbreviation entries — short entries ending in . (glossary artifacts)
Political party names — country-specific party/faction names from 1990s data that were parsed as field names

Additionally, a SUB_FIELD_LABELS set (34 known sub-field fragments) and REGIONAL_ENTRIES set (80+ regional sub-entries like "Turkish Area", "Serbia", "Sabah") provide explicit pattern matching.

Entity Classification (`classify_entities.py`)

Classifies each of the 281 MasterCountry entries into one of 9 entity types.

Classification Logic

Check OVERRIDES dictionary first — 40+ manually classified entries for edge cases:

OVERRIDES = {
    # Oceans and non-countries
    "XQ": "misc",      # Arctic Ocean
    "XX": "misc",      # World
    "EE": "misc",      # European Union

    # Disputed territories
    "KV": "disputed",  # Kosovo
    "GZ": "disputed",  # Gaza Strip
    "WE": "disputed",  # West Bank

    # Special Administrative Regions
    "HK": "special_admin",  # Hong Kong
    "MC": "special_admin",  # Macau

    # Crown dependencies
    "GK": "crown_dependency",  # Guernsey
    "JE": "crown_dependency",  # Jersey

    # Dissolved entities
    "NT": "dissolved",  # Netherlands Antilles
    "YI": "dissolved",  # Serbia and Montenegro
    ...
}

Check "Dependency status" field — If the most recent year's data contains a dependency status with keywords like "territory", "dependency", "overseas", "self-governing", classify as territory
Check "Government type" field — If it contains sovereign-state keywords like "republic", "monarchy", "parliamentary", "presidential", classify as sovereign
Fallback — If government type exists but doesn't match known patterns, default to sovereign

Entity Type Results

Type	Count	How Determined
sovereign	192	Government type contains republic/monarchy/parliamentary/etc.
territory	65	Has dependency status field, or manual override
misc	7	Manual override (oceans, World, EU)
disputed	6	Manual override (Kosovo, Gaza, West Bank, etc.)
crown_dependency	3	Manual override (Jersey, Guernsey, Isle of Man)
freely_associated	3	Manual override (Marshall Islands, Micronesia, Palau)
special_admin	2	Manual override (Hong Kong, Macau)
dissolved	2	Manual override (Netherlands Antilles, Serbia and Montenegro)
antarctic	1	Manual override (Antarctica)

Validation (`validate_integrity.py`)

Runs 9 automated checks against known ground truth:

Check	Method	Pass Criteria
1. Structural overview	Count countries/categories/fields per year	All years present with reasonable counts
2. US population benchmark	Compare extracted population against known Census values	Within 10M for each year 2000-2025
3. US GDP benchmark	Spot-check GDP field content	Contains plausible dollar amounts
4. Country count deltas	Year-over-year country count changes	No changes >10 countries (flags mergers/splits)
5. Source provenance	Verify source column matches expected pipeline	text/html/json assigned correctly
6. Field progression	Year-over-year field count % change	No anomalies >15% change
7. Category coverage	Check key categories (People, Economy) exist	<10 countries missing per year
8. China population	Spot-check China's population data	Plausible values across sampled years
9. Null/empty fields	Count NULL or empty Content values	<5% empty per year

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETL Pipeline

Overview

Pipeline 1: Project Gutenberg Text (1990-1999, 2001)

Source

Format Variants

Process

Why 2001 uses text

1996 Data Repair: CIA Original Text

Pipeline 2: Wayback Machine HTML (2000, 2002-2020)

Source

HTML Format Generations

Process

Wayback Machine Timestamps

Pipeline 3: GitHub JSON (2021-2025)

Source

Year-Specific Snapshots

Process

Post-ETL Processing

Field Name Standardization (`build_field_mappings.py`)

Rule Priority (applied in order)

The `KNOWN_RENAMES` Dictionary

The `CONSOLIDATION_MAP` Dictionary

Noise Detection (`is_noise()` function)

Entity Classification (`classify_entities.py`)

Classification Logic

Entity Type Results

Validation (`validate_integrity.py`)

FilesExpand file tree

ETL_PIPELINE.md

Latest commit

History

ETL_PIPELINE.md

File metadata and controls

ETL Pipeline

Overview

Pipeline 1: Project Gutenberg Text (1990-1999, 2001)

Source

Format Variants

Process

Why 2001 uses text

1996 Data Repair: CIA Original Text

Pipeline 2: Wayback Machine HTML (2000, 2002-2020)

Source

HTML Format Generations

Process

Wayback Machine Timestamps

Pipeline 3: GitHub JSON (2021-2025)

Source

Year-Specific Snapshots

Process

Post-ETL Processing

Field Name Standardization (build_field_mappings.py)

Rule Priority (applied in order)

The KNOWN_RENAMES Dictionary

The CONSOLIDATION_MAP Dictionary

Noise Detection (is_noise() function)

Entity Classification (classify_entities.py)

Classification Logic

Entity Type Results

Validation (validate_integrity.py)

Field Name Standardization (`build_field_mappings.py`)

The `KNOWN_RENAMES` Dictionary

The `CONSOLIDATION_MAP` Dictionary

Noise Detection (`is_noise()` function)

Entity Classification (`classify_entities.py`)

Validation (`validate_integrity.py`)