Skip to content

190-785/youtube-language-drift

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

YouTube Language Drift (Prototype)

An exploratory NLP pipeline that analyzes how language usage changes over time in a long-running YouTube channel, using publicly available video transcripts.

Status: Prototype / Exploratory analysis
This is a data exploration project — not formal research. All findings are correlational.


What It Does

  • Collects video transcripts and metadata from a YouTube channel (default: PewDiePie)
  • Computes interpretable NLP metrics per video: language detection, lexical diversity (type-token ratio), word count
  • Aggregates results year-wise, sampling up to 50 videos per year (seed=42) for normalized comparison
  • Measures vocabulary drift across years using TF-IDF cosine similarity
  • Identifies emerging and declining words — which words grew or faded across years
  • Tracks top phrases (bigrams) per year and vocabulary overlap between year pairs
  • Produces reproducible trend plots and a plain-text language change report

Repository Structure

youtube-language-drift/
├── scripts/
│   ├── 01_collect_data.py          # Fetch metadata + transcripts via YouTube API
│   ├── 02_compute_metrics.py       # Compute NLP metrics, sample, aggregate
│   ├── 03_plot_results.py          # Generate trend visualizations
│   └── 04_vocabulary_analysis.py   # Deep vocabulary & language change analysis
├── notebooks/
│   └── exploration.ipynb           # Interactive exploration of results
├── data/                           # Generated data (excluded from git)
├── results/                        # Plots and figures (committed)
├── requirements.txt
├── .env.example                    # Template for API key
├── .gitignore
├── LICENSE
└── README.md

Setup

1. Clone and install

git clone https://github.com/YOUR_USERNAME/youtube-language-drift.git
cd youtube-language-drift
python -m venv .venv
# Windows:
.venv\Scripts\activate
# macOS/Linux:
# source .venv/bin/activate
pip install -r requirements.txt

2. Get a YouTube Data API key

  1. Go to Google Cloud Console
  2. Create a project (or use an existing one)
  3. Enable the YouTube Data API v3
  4. Create an API key
  5. Copy .env.example to .env and paste your key:
cp .env.example .env
# Edit .env and replace your_api_key_here with your actual key

3. Run the pipeline

# Step 1: Collect data (use --limit for testing)
python scripts/01_collect_data.py --limit 100

# Step 2: Compute NLP metrics
python scripts/02_compute_metrics.py

# Step 3: Generate trend plots
python scripts/03_plot_results.py

# Step 4: Vocabulary & language change analysis
python scripts/04_vocabulary_analysis.py

Plots will be saved to results/. A plain-text language change report is saved to data/language_change_report.txt.

Note: Script 01 supports resume — if interrupted (e.g., by YouTube rate-limiting), re-run it and it will skip already-fetched transcripts.


Methodology

Data Collection

  • Video metadata (title, publish date, video ID) fetched via YouTube Data API v3
  • Transcripts fetched via youtube-transcript-api (prefers manual captions, falls back to auto-generated)
  • Videos without available transcripts are logged and skipped

Sampling

For each year, up to 50 videos are randomly sampled (seed=42) from those with available transcripts. This normalizes cross-year comparisons and reduces bias from years with disproportionately many uploads.

Metrics

Metric Method Purpose
Language detection langdetect library Identify language consistency across videos
Type-token ratio (TTR) unique tokens / total tokens (alpha-only, lowercased) Measure lexical diversity
Vocabulary drift TF-IDF cosine similarity between consecutive years Track vocabulary change over time
Word count Total alpha tokens per transcript Contextualize TTR (length sensitivity)
Top words per year Word frequency (per 10k tokens), stopwords removed Track common vocabulary per year
Emerging/declining words Frequency change between earliest and latest year Identify what changed most
Bigrams Top 2-word phrases per year Capture common phrases and topics
Vocabulary overlap Jaccard index between year-pair vocab sets Measure shared vs. unique vocabulary

Visualizations

Core metrics (Script 03):

  • coverage.png — Available vs. sampled transcripts per year
  • lexical_diversity.png — Mean TTR per year with ±1 std dev
  • vocabulary_drift.png — Cosine similarity between consecutive years
  • combined_dashboard.png — All three metrics in one figure

Vocabulary analysis (Script 04):

  • top_words_heatmap.png — Word frequency heatmap across years
  • word_trends.png — Frequency trends for emerging and declining words
  • emerging_declining.png — Bar chart of biggest vocabulary shifts
  • bigram_trends.png — Top phrases per year
  • vocabulary_overlap_matrix.png — Jaccard similarity between year pairs

Limitations

These are explicitly acknowledged:

  • Transcript quality varies. Auto-generated captions contain errors, especially for informal speech, music, or non-English content.
  • Missing transcripts. Some videos have no available captions, reducing coverage in certain years.
  • TTR is length-sensitive. Longer transcripts tend to have lower TTR. Sampling helps but doesn't fully control for this.
  • Sampling introduces variance. Different random seeds or sample sizes may yield different results.
  • Correlational only. No causal claims are made. Observed trends could reflect content changes, audience shifts, platform policy changes, or transcript quality differences.
  • Single channel. Results are specific to the analyzed channel and may not generalize.

Future Work

  • Hypothesis-driven analysis with statistical controls
  • Cross-channel comparison
  • Improved lexical diversity measures (MTLD, HD-D)
  • Topic-conditioned style analysis
  • Statistical significance testing
  • Controlling for transcript length effects

Tech Stack

Component Tool
Video metadata YouTube Data API v3
Transcripts youtube-transcript-api
Language detection langdetect
Text processing NLTK, scikit-learn
Data handling pandas
Visualization matplotlib

Resume Positioning

Describe this project as an exploratory NLP / text analysis pipeline. Emphasize:

  • End-to-end data pipeline: collection → preprocessing → analysis → visualization
  • Reproducible methodology with documented limitations
  • Year-wise trend analysis using interpretable NLP metrics
  • Clean, modular code with clear separation of concerns

Avoid calling it "research" until controlled validation is added. Frame it as a data exploration and engineering exercise that demonstrates practical NLP skills.


License

GPL-3.0 — see LICENSE for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published