Analyze long-run listening behavior using personal Last.fm history (~146k scrobbles). The project emphasizes a conventional, reproducible ETL pipeline and disciplined time-based aggregation rather than novelty or optimization.
Listening history is sourced from the Last.fm API via the user.getRecentTracks endpoint. Each scrobble represents a completed listening event. Data are retrieved via paginated API requests and stored locally as raw JSON responses.
Raw API data are not committed to this repository; all downstream analysis is performed on derived tabular datasets.
- The Last.fm API can be flaky and intermittently returns server errors (500/502/503/504).
- The endpoint represents “recent” listening activity, so total scrobble counts will change between runs as listening continues.
- No cleaning or transformation is performed during data ingestion to preserve source fidelity.
Retry logic and staged processing are used to mitigate these issues.
This repository is organized as a linear, reproducible pipeline:
-
01_fetch_lastfm.py
Fetch paginated listening history from the Last.fm API and write one raw JSON file per page to disk. No transformation or filtering is performed at this stage. -
02_flatten_lastfm.py
Read all raw JSON page files, extract one row per scrobble, and write a single flattened interim dataset as CSV. Lineage fields (source_page,source_file) are included for traceability. -
03_validate_interim.py
Validate structural assumptions about the interim dataset (schema, required fields, timestamp parseability, duplicate detection). This script enforces pipeline contracts before any transformation is applied. -
04_make_processed.py
Read the validated interim dataset, derive explicit UTC timestamp and time-part fields (date, year, month, day-of-week, hour), and write a processed CSV for downstream analysis. No filtering, deduplication, or analytical aggregation is performed.
Running these scripts in order reproduces the interim dataset used for analysis.
- Flattened scrobble dataset
data/interim/lastfm_scrobbles_interim.csv
Schema:
played_at_utctrack_nameartist_namealbum_nametrack_mbidartist_mbidalbum_mbidsource_pagesource_file
- Time-enriched scrobble dataset
data/processed/lastfm_scrobbles_processed.csv
Derived fields include:
- UTC timestamp (
played_at_ts_utc) - Calendar date (
date_utc) - Year, month, day-of-week, and hour (UTC)
The processed dataset preserves all interim fields and adds only derived time features. No records are filtered or modified.
- Python 3.11 (conda / Anaconda)
- Key packages:
requestspython-dotenv
- Secrets are managed via a local
.envfile (not committed) .env.exampledocuments required environment variables
To reproduce:
- Clone the repository
- Create a
.envfile with: LASTFM_API_KEY=... LASTFM_USERNAME=... - Run: python src/01_fetch_lastfm.py python src/02_flatten_lastfm.py python src/03_validate_interim.py python src/04_make_processed.py
A Tableau dashboard built on the processed dataset analyzes longitudinal listening behavior across three dimensions:
- Volume — total scrobbles per year
- Breadth — distinct artists per year
- Intensity — average scrobbles per artist per year
Key structural findings:
- Listening volume increased materially after 2018.
- Distinct artist breadth expanded sharply beginning around 2013.
- Average scrobbles per artist declined as breadth increased.
Tableau Public dashboard: Long-run listening behavior
