YouTube Language Drift (Prototype)

An exploratory NLP pipeline that analyzes how language usage changes over time in a long-running YouTube channel, using publicly available video transcripts.

Status: Prototype / Exploratory analysis
This is a data exploration project — not formal research. All findings are correlational.

What It Does

Collects video transcripts and metadata from a YouTube channel (default: PewDiePie)
Computes interpretable NLP metrics per video: language detection, lexical diversity (type-token ratio), word count
Aggregates results year-wise, sampling up to 50 videos per year (seed=42) for normalized comparison
Measures vocabulary drift across years using TF-IDF cosine similarity
Identifies emerging and declining words — which words grew or faded across years
Tracks top phrases (bigrams) per year and vocabulary overlap between year pairs
Produces reproducible trend plots and a plain-text language change report

Repository Structure

youtube-language-drift/
├── scripts/
│   ├── 01_collect_data.py          # Fetch metadata + transcripts via YouTube API
│   ├── 02_compute_metrics.py       # Compute NLP metrics, sample, aggregate
│   ├── 03_plot_results.py          # Generate trend visualizations
│   └── 04_vocabulary_analysis.py   # Deep vocabulary & language change analysis
├── notebooks/
│   └── exploration.ipynb           # Interactive exploration of results
├── data/                           # Generated data (excluded from git)
├── results/                        # Plots and figures (committed)
├── requirements.txt
├── .env.example                    # Template for API key
├── .gitignore
├── LICENSE
└── README.md

Setup

1. Clone and install

git clone https://github.com/YOUR_USERNAME/youtube-language-drift.git
cd youtube-language-drift
python -m venv .venv
# Windows:
.venv\Scripts\activate
# macOS/Linux:
# source .venv/bin/activate
pip install -r requirements.txt

2. Get a YouTube Data API key

Go to Google Cloud Console
Create a project (or use an existing one)
Enable the YouTube Data API v3
Create an API key
Copy .env.example to .env and paste your key:

cp .env.example .env
# Edit .env and replace your_api_key_here with your actual key

3. Run the pipeline

# Step 1: Collect data (use --limit for testing)
python scripts/01_collect_data.py --limit 100

# Step 2: Compute NLP metrics
python scripts/02_compute_metrics.py

# Step 3: Generate trend plots
python scripts/03_plot_results.py

# Step 4: Vocabulary & language change analysis
python scripts/04_vocabulary_analysis.py

Plots will be saved to results/. A plain-text language change report is saved to data/language_change_report.txt.

Note: Script 01 supports resume — if interrupted (e.g., by YouTube rate-limiting), re-run it and it will skip already-fetched transcripts.

Methodology

Data Collection

Video metadata (title, publish date, video ID) fetched via YouTube Data API v3
Transcripts fetched via youtube-transcript-api (prefers manual captions, falls back to auto-generated)
Videos without available transcripts are logged and skipped

Sampling

For each year, up to 50 videos are randomly sampled (seed=42) from those with available transcripts. This normalizes cross-year comparisons and reduces bias from years with disproportionately many uploads.

Metrics

Metric	Method	Purpose
Language detection	`langdetect` library	Identify language consistency across videos
Type-token ratio (TTR)	unique tokens / total tokens (alpha-only, lowercased)	Measure lexical diversity
Vocabulary drift	TF-IDF cosine similarity between consecutive years	Track vocabulary change over time
Word count	Total alpha tokens per transcript	Contextualize TTR (length sensitivity)
Top words per year	Word frequency (per 10k tokens), stopwords removed	Track common vocabulary per year
Emerging/declining words	Frequency change between earliest and latest year	Identify what changed most
Bigrams	Top 2-word phrases per year	Capture common phrases and topics
Vocabulary overlap	Jaccard index between year-pair vocab sets	Measure shared vs. unique vocabulary

Visualizations

Core metrics (Script 03):

coverage.png — Available vs. sampled transcripts per year
lexical_diversity.png — Mean TTR per year with ±1 std dev
vocabulary_drift.png — Cosine similarity between consecutive years
combined_dashboard.png — All three metrics in one figure

Vocabulary analysis (Script 04):

top_words_heatmap.png — Word frequency heatmap across years
word_trends.png — Frequency trends for emerging and declining words
emerging_declining.png — Bar chart of biggest vocabulary shifts
bigram_trends.png — Top phrases per year
vocabulary_overlap_matrix.png — Jaccard similarity between year pairs

Limitations

These are explicitly acknowledged:

Transcript quality varies. Auto-generated captions contain errors, especially for informal speech, music, or non-English content.
Missing transcripts. Some videos have no available captions, reducing coverage in certain years.
TTR is length-sensitive. Longer transcripts tend to have lower TTR. Sampling helps but doesn't fully control for this.
Sampling introduces variance. Different random seeds or sample sizes may yield different results.
Correlational only. No causal claims are made. Observed trends could reflect content changes, audience shifts, platform policy changes, or transcript quality differences.
Single channel. Results are specific to the analyzed channel and may not generalize.

Future Work

Hypothesis-driven analysis with statistical controls
Cross-channel comparison
Improved lexical diversity measures (MTLD, HD-D)
Topic-conditioned style analysis
Statistical significance testing
Controlling for transcript length effects

Tech Stack

Component	Tool
Video metadata	YouTube Data API v3
Transcripts	youtube-transcript-api
Language detection	langdetect
Text processing	NLTK, scikit-learn
Data handling	pandas
Visualization	matplotlib

Resume Positioning

Describe this project as an exploratory NLP / text analysis pipeline. Emphasize:

End-to-end data pipeline: collection → preprocessing → analysis → visualization
Reproducible methodology with documented limitations
Year-wise trend analysis using interpretable NLP metrics
Clean, modular code with clear separation of concerns

Avoid calling it "research" until controlled validation is added. Frame it as a data exploration and engineering exercise that demonstrates practical NLP skills.

License

GPL-3.0 — see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YouTube Language Drift (Prototype)

What It Does

Repository Structure

Setup

1. Clone and install

2. Get a YouTube Data API key

3. Run the pipeline

Methodology

Data Collection

Sampling

Metrics

Visualizations

Limitations

Future Work

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
notebooks		notebooks
results		results
scripts		scripts
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

190-785/youtube-language-drift

Folders and files

Latest commit

History

Repository files navigation

YouTube Language Drift (Prototype)

What It Does

Repository Structure

Setup

1. Clone and install

2. Get a YouTube Data API key

3. Run the pipeline

Methodology

Data Collection

Sampling

Metrics

Visualizations

Limitations

Future Work

Tech Stack

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages