This repository contains my code for Data Mining Spring 2026 Assignment 3: Human Activity Recognition (HAR).
The task is to classify human activities from wrist-worn accelerometer sequences. Each test sample is a 5-minute CSV file, and the goal is to predict one activity label from 0 to 5 for each file.
The final Kaggle submission was generated using a reproducible ensemble-based pipeline followed by a deterministic model-guided correction stage.
Course: Data Mining, Spring 2026
Assignment: Assignment 3 — Human Activity Recognition
Student ID: [FILL YOUR STUDENT ID]
Name: [FILL YOUR NAME]
Kaggle Team Name: [FILL YOUR STUDENT ID]
Best Public Leaderboard Score: 0.8154
Best Public Submission File: SUBMIT_v63_01_more_3to1_x3.csv
The dataset consists of accelerometer readings collected from users wearing wrist devices. Each CSV file represents a 5-minute time window. The model must predict one activity label for each test CSV file.
The Kaggle submission format is:
Id,Label
where:
Idis the file ID of a test CSV file.Labelis the predicted activity class.- Valid label values are integers from
0to5.
The evaluation metric is macro-F1 score. Macro-F1 gives equal importance to each class, so performance on minority activity classes is important.
The repository contains the final reproducible code and the required supporting artifacts.
.
├── README.md
├── requirements.txt
├── .gitignore
├── har_pipeline_v64_professional_final_push.py
├── make_submission_valid.py
├── submission_v57_minority_candidate_ranking.csv
├── submission_v54_ttw_brave.csv
└── SUBMIT_v63_01_more_3to1_x3.csv
| File | Purpose |
|---|---|
har_pipeline_v64_professional_final_push.py |
Final reproducible submission generator |
make_submission_valid.py |
Submission validator that checks Id,Label format and alignment |
submission_v57_minority_candidate_ranking.csv |
Model-produced transition candidate ranking |
submission_v54_ttw_brave.csv |
Base submission used before final correction |
SUBMIT_v63_01_more_3to1_x3.csv |
Current best submission used as the starting point for the final stage |
requirements.txt |
Python package dependencies |
README.md |
Run instructions and reproducibility documentation |
The raw Kaggle dataset is not included in this repository. Please download it from the Kaggle competition page.
After downloading the Kaggle dataset, place it in the following structure:
project_root/
│
├── nycu-data-mining-assignment-3/
│ ├── sample_submission.csv
│ ├── train/
│ │ └── train/
│ │ ├── User_xxx/
│ │ │ ├── *.csv
│ │ │ └── ...
│ └── test/
│ └── test/
│ ├── User_xxx/
│ │ ├── *.csv
│ │ └── ...
│
├── train_labels.csv
├── har_pipeline_v64_professional_final_push.py
├── make_submission_valid.py
├── submission_v57_minority_candidate_ranking.csv
├── submission_v54_ttw_brave.csv
└── SUBMIT_v63_01_more_3to1_x3.csv
The final V64 script only needs the official sample_submission.csv, the base submission, the current best submission, and the candidate ranking file. Earlier training scripts used the raw train/test folders and train_labels.csv to build the ensemble and candidate ranking.
This project was run using Python in a virtual environment.
On Windows PowerShell:
python -m venv .venv_cuda
.\.venv_cuda\Scripts\Activate.ps1pip install --upgrade pip
pip install -r requirements.txtThe main final script requires mainly:
numpy
pandas
Some earlier modeling stages used additional packages such as:
scikit-learn
lightgbm
xgboost
catboost
torch
The final solution consists of two major parts:
- Ensemble-based prediction pipeline
- Deterministic model-guided correction stage
The ensemble stage produced strong probability-based predictions from multiple models. Then, an out-of-fold candidate-ranking policy identified high-confidence label transition corrections.
Public leaderboard experiments showed that simple majority-class corrections were not sufficient. The strongest improvement came from a minority-related correction branch, especially the transition:
3 -> 1
Therefore, the final pipeline continues this empirically validated 3 -> 1 cleanup branch in a controlled and reproducible manner.
The final script does not manually edit individual test IDs. It applies deterministic rules to a model-produced candidate ranking.
The final script is:
har_pipeline_v64_professional_final_push.py
It takes the following inputs:
sample_submission.csv
submission_v54_ttw_brave.csv
SUBMIT_v63_01_more_3to1_x3.csv
submission_v57_minority_candidate_ranking.csv
It generates final candidate submissions, validates them, and writes audit files.
python har_pipeline_v64_professional_final_push.py `
--sample_submission "nycu-data-mining-assignment-3\sample_submission.csv" `
--base_csv "submission_v54_ttw_brave.csv" `
--current_best_csv "SUBMIT_v63_01_more_3to1_x3.csv" `
--current_best_score 0.8154 `
--v57_ranking "submission_v57_minority_candidate_ranking.csv" `
--output_prefix "submission_v64_final"Expected outputs include:
submission_v64_final_more_3to1_x1.csv
submission_v64_final_more_3to1_x2.csv
submission_v64_final_more_3to1_x3.csv
submission_v64_final_more_3to1_x4.csv
submission_v64_final_more_3to1_x6.csv
submission_v64_final_combo_3to1x2_5to1x1.csv
submission_v64_final_manifest.csv
submission_v64_final_manifest.json
submission_v64_final_report_ready_summary.md
The manifest files are for reproducibility and report documentation. They should not be uploaded to Kaggle.
Before uploading a CSV file to Kaggle, validate it using:
python make_submission_valid.py `
--input_csv "submission_v64_final_more_3to1_x2.csv" `
--sample_submission "nycu-data-mining-assignment-3\sample_submission.csv" `
--output_csv "SUBMIT_v64_01_more_3to1_x2.csv"Expected output:
OK wrote: SUBMIT_v64_01_more_3to1_x2.csv
rows=6849 columns=['Id', 'Label']
Ready to submit this output CSV.
Only the validated SUBMIT_*.csv file should be uploaded to Kaggle.
The final pipeline is designed to be reproducible.
It guarantees that:
- The output file has exactly two columns:
IdandLabel. - All IDs are aligned with
sample_submission.csv. - There are no duplicated IDs.
- There are no missing or extra IDs.
- All labels are integers from
0to5. - All final variants are generated deterministically from code.
- No hidden test labels are used.
- No manual per-ID label editing is performed outside the script.
The script also writes:
submission_v64_final_manifest.csv
submission_v64_final_manifest.json
submission_v64_final_report_ready_summary.md
These files document the generated variants, label counts, and transition changes.
The following table summarizes the important public leaderboard improvements.
| Stage | Main Idea | Public Score |
|---|---|---|
| V54 brave | Earlier strong baseline with residual correction | 0.8106 |
| V57 mixed minority top35 | Minority transition correction | 0.8128 |
V59 plus 3 -> 1 top3 |
Continued 3 -> 1 cleanup |
0.8138 |
V61 add 3 -> 1 next5 |
Stronger 3 -> 1 continuation |
0.8150 |
| V63/V64 current best branch | Controlled 3 -> 1 continuation |
0.8154 |
The best public score achieved was:
0.8154
This repository contains reproducible code for the final submission stage.
The final correction stage is based on:
- out-of-fold model development,
- candidate transition ranking,
- deterministic transition selection,
- and strict submission validation.
AI tools were used only as assistance for coding, debugging, and report organization. The submitted code, report, and Kaggle result are intended to be consistent and reproducible.
To reproduce the final candidate:
- Download the Kaggle dataset.
- Place the dataset in the required folder structure.
- Install dependencies using
requirements.txt. - Run
har_pipeline_v64_professional_final_push.py. - Validate the selected output using
make_submission_valid.py. - Upload the validated
SUBMIT_*.csvfile to Kaggle.
Example full run:
python har_pipeline_v64_professional_final_push.py `
--sample_submission "nycu-data-mining-assignment-3\sample_submission.csv" `
--base_csv "submission_v54_ttw_brave.csv" `
--current_best_csv "SUBMIT_v63_01_more_3to1_x3.csv" `
--current_best_score 0.8154 `
--v57_ranking "submission_v57_minority_candidate_ranking.csv" `
--output_prefix "submission_v64_final"
python make_submission_valid.py `
--input_csv "submission_v64_final_more_3to1_x2.csv" `
--sample_submission "nycu-data-mining-assignment-3\sample_submission.csv" `
--output_csv "SUBMIT_v64_01_more_3to1_x2.csv"The final file to upload is:
SUBMIT_v64_01_more_3to1_x2.csv
For this assignment, please refer to the submitted report for the full method description, ablation study, and leaderboard screenshots.