Advanced Multi-Class Classification Workflow using Class-Balanced Voting Ensemble
Build a robust machine learning classification model to accurately predict the track_genre of Spotify songs based on distinct audio features and metadata (without AutoML tools, external APIs, or pre-trained models).
DataSprint - Data Science Club, NIST University — Kaggle Hackathon Link
| Attributes | Description |
|---|---|
track_genre |
Target categorical variable (e.g., pop, rock, classical) |
danceability, energy, tempo |
Core acoustic and rhythmic features |
acousticness, liveness, valence |
Tonal and mood indicators |
duration_ms, time_signature |
Track structural and length data |
| Metric | Value |
|---|---|
| Training tracks | 84,800 |
| Test tracks | 34,200 (unlabeled flat test.csv) |
| Format | .csv (Tabular numeric and categorical data) |
Features Overview: The datasets (
train.csv,test.csv) contain numerous acoustic vectors representing the track's sound profile alongside entity metadata (e.g.,artists). The targettrack_genreis only present in the train split.
- Missing numeric and categorical data were handled independently utilizing median/mode-based imputation.
- Executed Frequency Encoding explicitly targeting the
artistsfeature. - Normalization: Applied
StandardScalerto ensure features contributed uniformly across internal distance metrics and split criteria.
Derived meaningful new dimensions from raw inputs to amplify predictive signals.
- Synthesized interaction terms representing acoustic dynamics (e.g.,
energy_x_danceability). - Identified
mood_divergencemetrics through overlapping feature spaces.
Enhancing the base dimensional space allowed tree-based algorithms to discover richer splits early in their decision paths.
Tackled complex, overlapping music genre clusters by developing a Class-Balanced Voting Ensemble, leveraging complementary tree structures.
LightGBM (Gradient Boosting) -> efficiently fits deep non-linear gradients to accurately identify dense clusters. Extra Trees (Extremely Randomized Trees) -> mitigates overfitting by introducing extreme variance during feature partitioning.
Ensemble voting captures both deep contextual gradient boundaries and generalized random boundaries for stable, generalized performance.
Adopted a robust Cross-Validation architecture avoiding traditional validation holdouts. Stratified Out-of-Fold logic evaluated metrics over the entire dataset, serving as a highly reliable generalized estimator for the unseen 34,200 test tracks.
Evaluated across the entire dataset via Out-of-Fold (OOF) validation strategy
| Metric | Score |
|---|---|
| Accuracy Assessment | 0.3321 |
| Precision Aggregate | 0.3291 |
| F1 Score Evaluation | 0.3276 |
Spotify Track Genre Classification
├── NOTICE
├── LICENSE
├── README.md -> High-level summary (This file)
├── data/ -> Provided datasets (`train.csv`, `test.csv`)
├── report.txt -> Detailed approach and methodology report
├── submission.csv -> Final Kaggle prediction submission (34,200 rows)
└── track_genre_classification.ipynb -> Full Jupyter Notebook - preprocessing, training, evaluation