Data Mining Assignment 3: Human Activity Recognition

This repository contains my code for Data Mining Spring 2026 Assignment 3: Human Activity Recognition (HAR).

The task is to classify human activities from wrist-worn accelerometer sequences. Each test sample is a 5-minute CSV file, and the goal is to predict one activity label from 0 to 5 for each file.

The final Kaggle submission was generated using a reproducible ensemble-based pipeline followed by a deterministic model-guided correction stage.

1. Project Information

Course: Data Mining, Spring 2026 Assignment: Assignment 3 — Human Activity Recognition Student ID: [FILL YOUR STUDENT ID] Name: [FILL YOUR NAME] Kaggle Team Name: [FILL YOUR STUDENT ID] Best Public Leaderboard Score: 0.8154 Best Public Submission File: SUBMIT_v63_01_more_3to1_x3.csv

2. Task Description

The dataset consists of accelerometer readings collected from users wearing wrist devices. Each CSV file represents a 5-minute time window. The model must predict one activity label for each test CSV file.

The Kaggle submission format is:

Id,Label

where:

Id is the file ID of a test CSV file.
Label is the predicted activity class.
Valid label values are integers from 0 to 5.

The evaluation metric is macro-F1 score. Macro-F1 gives equal importance to each class, so performance on minority activity classes is important.

3. Repository Structure

The repository contains the final reproducible code and the required supporting artifacts.

.
├── README.md
├── requirements.txt
├── .gitignore
├── har_pipeline_v64_professional_final_push.py
├── make_submission_valid.py
├── submission_v57_minority_candidate_ranking.csv
├── submission_v54_ttw_brave.csv
└── SUBMIT_v63_01_more_3to1_x3.csv

File Description

File	Purpose
`har_pipeline_v64_professional_final_push.py`	Final reproducible submission generator
`make_submission_valid.py`	Submission validator that checks `Id,Label` format and alignment
`submission_v57_minority_candidate_ranking.csv`	Model-produced transition candidate ranking
`submission_v54_ttw_brave.csv`	Base submission used before final correction
`SUBMIT_v63_01_more_3to1_x3.csv`	Current best submission used as the starting point for the final stage
`requirements.txt`	Python package dependencies
`README.md`	Run instructions and reproducibility documentation

The raw Kaggle dataset is not included in this repository. Please download it from the Kaggle competition page.

4. Dataset Placement

After downloading the Kaggle dataset, place it in the following structure:

project_root/
│
├── nycu-data-mining-assignment-3/
│   ├── sample_submission.csv
│   ├── train/
│   │   └── train/
│   │       ├── User_xxx/
│   │       │   ├── *.csv
│   │       │   └── ...
│   └── test/
│       └── test/
│           ├── User_xxx/
│           │   ├── *.csv
│           │   └── ...
│
├── train_labels.csv
├── har_pipeline_v64_professional_final_push.py
├── make_submission_valid.py
├── submission_v57_minority_candidate_ranking.csv
├── submission_v54_ttw_brave.csv
└── SUBMIT_v63_01_more_3to1_x3.csv

The final V64 script only needs the official sample_submission.csv, the base submission, the current best submission, and the candidate ranking file. Earlier training scripts used the raw train/test folders and train_labels.csv to build the ensemble and candidate ranking.

5. Environment Setup

This project was run using Python in a virtual environment.

5.1 Create virtual environment

On Windows PowerShell:

python -m venv .venv_cuda
.\.venv_cuda\Scripts\Activate.ps1

5.2 Install dependencies

pip install --upgrade pip
pip install -r requirements.txt

The main final script requires mainly:

numpy
pandas

Some earlier modeling stages used additional packages such as:

scikit-learn
lightgbm
xgboost
catboost
torch

6. Final Method Summary

The final solution consists of two major parts:

Ensemble-based prediction pipeline
Deterministic model-guided correction stage

The ensemble stage produced strong probability-based predictions from multiple models. Then, an out-of-fold candidate-ranking policy identified high-confidence label transition corrections.

Public leaderboard experiments showed that simple majority-class corrections were not sufficient. The strongest improvement came from a minority-related correction branch, especially the transition:

3 -> 1

Therefore, the final pipeline continues this empirically validated 3 -> 1 cleanup branch in a controlled and reproducible manner.

The final script does not manually edit individual test IDs. It applies deterministic rules to a model-produced candidate ranking.

7. Final Reproducible Pipeline

The final script is:

har_pipeline_v64_professional_final_push.py

It takes the following inputs:

sample_submission.csv
submission_v54_ttw_brave.csv
SUBMIT_v63_01_more_3to1_x3.csv
submission_v57_minority_candidate_ranking.csv

It generates final candidate submissions, validates them, and writes audit files.

7.1 Run final generator

python har_pipeline_v64_professional_final_push.py `
  --sample_submission "nycu-data-mining-assignment-3\sample_submission.csv" `
  --base_csv "submission_v54_ttw_brave.csv" `
  --current_best_csv "SUBMIT_v63_01_more_3to1_x3.csv" `
  --current_best_score 0.8154 `
  --v57_ranking "submission_v57_minority_candidate_ranking.csv" `
  --output_prefix "submission_v64_final"

Expected outputs include:

submission_v64_final_more_3to1_x1.csv
submission_v64_final_more_3to1_x2.csv
submission_v64_final_more_3to1_x3.csv
submission_v64_final_more_3to1_x4.csv
submission_v64_final_more_3to1_x6.csv
submission_v64_final_combo_3to1x2_5to1x1.csv
submission_v64_final_manifest.csv
submission_v64_final_manifest.json
submission_v64_final_report_ready_summary.md

The manifest files are for reproducibility and report documentation. They should not be uploaded to Kaggle.

8. Validate Kaggle Submission

Before uploading a CSV file to Kaggle, validate it using:

python make_submission_valid.py `
  --input_csv "submission_v64_final_more_3to1_x2.csv" `
  --sample_submission "nycu-data-mining-assignment-3\sample_submission.csv" `
  --output_csv "SUBMIT_v64_01_more_3to1_x2.csv"

Expected output:

OK wrote: SUBMIT_v64_01_more_3to1_x2.csv
rows=6849 columns=['Id', 'Label']
Ready to submit this output CSV.

Only the validated SUBMIT_*.csv file should be uploaded to Kaggle.

9. Reproducibility Guarantees

The final pipeline is designed to be reproducible.

It guarantees that:

The output file has exactly two columns: Id and Label.
All IDs are aligned with sample_submission.csv.
There are no duplicated IDs.
There are no missing or extra IDs.
All labels are integers from 0 to 5.
All final variants are generated deterministically from code.
No hidden test labels are used.
No manual per-ID label editing is performed outside the script.

The script also writes:

submission_v64_final_manifest.csv
submission_v64_final_manifest.json
submission_v64_final_report_ready_summary.md

These files document the generated variants, label counts, and transition changes.

10. Public Leaderboard Progress

The following table summarizes the important public leaderboard improvements.

Stage	Main Idea	Public Score
V54 brave	Earlier strong baseline with residual correction	0.8106
V57 mixed minority top35	Minority transition correction	0.8128
V59 plus `3 -> 1` top3	Continued `3 -> 1` cleanup	0.8138
V61 add `3 -> 1` next5	Stronger `3 -> 1` continuation	0.8150
V63/V64 current best branch	Controlled `3 -> 1` continuation	0.8154

The best public score achieved was:

0.8154

11. Notes on Academic Integrity

This repository contains reproducible code for the final submission stage.

The final correction stage is based on:

out-of-fold model development,
candidate transition ranking,
deterministic transition selection,
and strict submission validation.

AI tools were used only as assistance for coding, debugging, and report organization. The submitted code, report, and Kaggle result are intended to be consistent and reproducible.

12. How to Reproduce the Final Candidate

To reproduce the final candidate:

Download the Kaggle dataset.
Place the dataset in the required folder structure.
Install dependencies using requirements.txt.
Run har_pipeline_v64_professional_final_push.py.
Validate the selected output using make_submission_valid.py.
Upload the validated SUBMIT_*.csv file to Kaggle.

Example full run:

python har_pipeline_v64_professional_final_push.py `
  --sample_submission "nycu-data-mining-assignment-3\sample_submission.csv" `
  --base_csv "submission_v54_ttw_brave.csv" `
  --current_best_csv "SUBMIT_v63_01_more_3to1_x3.csv" `
  --current_best_score 0.8154 `
  --v57_ranking "submission_v57_minority_candidate_ranking.csv" `
  --output_prefix "submission_v64_final"

python make_submission_valid.py `
  --input_csv "submission_v64_final_more_3to1_x2.csv" `
  --sample_submission "nycu-data-mining-assignment-3\sample_submission.csv" `
  --output_csv "SUBMIT_v64_01_more_3to1_x2.csv"

The final file to upload is:

SUBMIT_v64_01_more_3to1_x2.csv

13. Contact

For this assignment, please refer to the submitted report for the full method description, ablation study, and leaderboard screenshots.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Mining Assignment 3: Human Activity Recognition

1. Project Information

2. Task Description

3. Repository Structure

File Description

4. Dataset Placement

5. Environment Setup

5.1 Create virtual environment

5.2 Install dependencies

6. Final Method Summary

7. Final Reproducible Pipeline

7.1 Run final generator

8. Validate Kaggle Submission

9. Reproducibility Guarantees

10. Public Leaderboard Progress

11. Notes on Academic Integrity

12. How to Reproduce the Final Candidate

13. Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
SUBMIT_v63_01_more_3to1_x3.csv		SUBMIT_v63_01_more_3to1_x3.csv
SUBMIT_v64_01_more_3to1_x2.csv		SUBMIT_v64_01_more_3to1_x2.csv
har_pipeline_v64_professional_final_push.py		har_pipeline_v64_professional_final_push.py
make_submission_valid.py		make_submission_valid.py
requirements.txt		requirements.txt
submission_v54_ttw_brave.csv		submission_v54_ttw_brave.csv
submission_v57_minority_candidate_ranking.csv		submission_v57_minority_candidate_ranking.csv

Folders and files

Latest commit

History

Repository files navigation

Data Mining Assignment 3: Human Activity Recognition

1. Project Information

2. Task Description

3. Repository Structure

File Description

4. Dataset Placement

5. Environment Setup

5.1 Create virtual environment

5.2 Install dependencies

6. Final Method Summary

7. Final Reproducible Pipeline

7.1 Run final generator

8. Validate Kaggle Submission

9. Reproducibility Guarantees

10. Public Leaderboard Progress

11. Notes on Academic Integrity

12. How to Reproduce the Final Candidate

13. Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages