An exploratory NLP pipeline that analyzes how language usage changes over time in a long-running YouTube channel, using publicly available video transcripts.
Status: Prototype / Exploratory analysis
This is a data exploration project — not formal research. All findings are correlational.
- Collects video transcripts and metadata from a YouTube channel (default: PewDiePie)
- Computes interpretable NLP metrics per video: language detection, lexical diversity (type-token ratio), word count
- Aggregates results year-wise, sampling up to 50 videos per year (seed=42) for normalized comparison
- Measures vocabulary drift across years using TF-IDF cosine similarity
- Identifies emerging and declining words — which words grew or faded across years
- Tracks top phrases (bigrams) per year and vocabulary overlap between year pairs
- Produces reproducible trend plots and a plain-text language change report
youtube-language-drift/
├── scripts/
│ ├── 01_collect_data.py # Fetch metadata + transcripts via YouTube API
│ ├── 02_compute_metrics.py # Compute NLP metrics, sample, aggregate
│ ├── 03_plot_results.py # Generate trend visualizations
│ └── 04_vocabulary_analysis.py # Deep vocabulary & language change analysis
├── notebooks/
│ └── exploration.ipynb # Interactive exploration of results
├── data/ # Generated data (excluded from git)
├── results/ # Plots and figures (committed)
├── requirements.txt
├── .env.example # Template for API key
├── .gitignore
├── LICENSE
└── README.md
git clone https://github.com/YOUR_USERNAME/youtube-language-drift.git
cd youtube-language-drift
python -m venv .venv
# Windows:
.venv\Scripts\activate
# macOS/Linux:
# source .venv/bin/activate
pip install -r requirements.txt- Go to Google Cloud Console
- Create a project (or use an existing one)
- Enable the YouTube Data API v3
- Create an API key
- Copy
.env.exampleto.envand paste your key:
cp .env.example .env
# Edit .env and replace your_api_key_here with your actual key# Step 1: Collect data (use --limit for testing)
python scripts/01_collect_data.py --limit 100
# Step 2: Compute NLP metrics
python scripts/02_compute_metrics.py
# Step 3: Generate trend plots
python scripts/03_plot_results.py
# Step 4: Vocabulary & language change analysis
python scripts/04_vocabulary_analysis.pyPlots will be saved to results/. A plain-text language change report is saved to data/language_change_report.txt.
Note: Script 01 supports resume — if interrupted (e.g., by YouTube rate-limiting), re-run it and it will skip already-fetched transcripts.
- Video metadata (title, publish date, video ID) fetched via YouTube Data API v3
- Transcripts fetched via youtube-transcript-api (prefers manual captions, falls back to auto-generated)
- Videos without available transcripts are logged and skipped
For each year, up to 50 videos are randomly sampled (seed=42) from those with available transcripts. This normalizes cross-year comparisons and reduces bias from years with disproportionately many uploads.
| Metric | Method | Purpose |
|---|---|---|
| Language detection | langdetect library |
Identify language consistency across videos |
| Type-token ratio (TTR) | unique tokens / total tokens (alpha-only, lowercased) | Measure lexical diversity |
| Vocabulary drift | TF-IDF cosine similarity between consecutive years | Track vocabulary change over time |
| Word count | Total alpha tokens per transcript | Contextualize TTR (length sensitivity) |
| Top words per year | Word frequency (per 10k tokens), stopwords removed | Track common vocabulary per year |
| Emerging/declining words | Frequency change between earliest and latest year | Identify what changed most |
| Bigrams | Top 2-word phrases per year | Capture common phrases and topics |
| Vocabulary overlap | Jaccard index between year-pair vocab sets | Measure shared vs. unique vocabulary |
Core metrics (Script 03):
coverage.png— Available vs. sampled transcripts per yearlexical_diversity.png— Mean TTR per year with ±1 std devvocabulary_drift.png— Cosine similarity between consecutive yearscombined_dashboard.png— All three metrics in one figure
Vocabulary analysis (Script 04):
top_words_heatmap.png— Word frequency heatmap across yearsword_trends.png— Frequency trends for emerging and declining wordsemerging_declining.png— Bar chart of biggest vocabulary shiftsbigram_trends.png— Top phrases per yearvocabulary_overlap_matrix.png— Jaccard similarity between year pairs
These are explicitly acknowledged:
- Transcript quality varies. Auto-generated captions contain errors, especially for informal speech, music, or non-English content.
- Missing transcripts. Some videos have no available captions, reducing coverage in certain years.
- TTR is length-sensitive. Longer transcripts tend to have lower TTR. Sampling helps but doesn't fully control for this.
- Sampling introduces variance. Different random seeds or sample sizes may yield different results.
- Correlational only. No causal claims are made. Observed trends could reflect content changes, audience shifts, platform policy changes, or transcript quality differences.
- Single channel. Results are specific to the analyzed channel and may not generalize.
- Hypothesis-driven analysis with statistical controls
- Cross-channel comparison
- Improved lexical diversity measures (MTLD, HD-D)
- Topic-conditioned style analysis
- Statistical significance testing
- Controlling for transcript length effects
| Component | Tool |
|---|---|
| Video metadata | YouTube Data API v3 |
| Transcripts | youtube-transcript-api |
| Language detection | langdetect |
| Text processing | NLTK, scikit-learn |
| Data handling | pandas |
| Visualization | matplotlib |
Resume Positioning
Describe this project as an exploratory NLP / text analysis pipeline. Emphasize:
- End-to-end data pipeline: collection → preprocessing → analysis → visualization
- Reproducible methodology with documented limitations
- Year-wise trend analysis using interpretable NLP metrics
- Clean, modular code with clear separation of concerns
Avoid calling it "research" until controlled validation is added. Frame it as a data exploration and engineering exercise that demonstrates practical NLP skills.
GPL-3.0 — see LICENSE for details.