Skip to content

BasithZhang/dm-assignment3-har

Repository files navigation

Data Mining Assignment 3: Human Activity Recognition

This repository contains my code for Data Mining Spring 2026 Assignment 3: Human Activity Recognition (HAR).

The task is to classify human activities from wrist-worn accelerometer sequences. Each test sample is a 5-minute CSV file, and the goal is to predict one activity label from 0 to 5 for each file.

The final Kaggle submission was generated using a reproducible ensemble-based pipeline followed by a deterministic model-guided correction stage.


1. Project Information

Course: Data Mining, Spring 2026 Assignment: Assignment 3 — Human Activity Recognition Student ID: [FILL YOUR STUDENT ID] Name: [FILL YOUR NAME] Kaggle Team Name: [FILL YOUR STUDENT ID] Best Public Leaderboard Score: 0.8154 Best Public Submission File: SUBMIT_v63_01_more_3to1_x3.csv


2. Task Description

The dataset consists of accelerometer readings collected from users wearing wrist devices. Each CSV file represents a 5-minute time window. The model must predict one activity label for each test CSV file.

The Kaggle submission format is:

Id,Label

where:

  • Id is the file ID of a test CSV file.
  • Label is the predicted activity class.
  • Valid label values are integers from 0 to 5.

The evaluation metric is macro-F1 score. Macro-F1 gives equal importance to each class, so performance on minority activity classes is important.


3. Repository Structure

The repository contains the final reproducible code and the required supporting artifacts.

.
├── README.md
├── requirements.txt
├── .gitignore
├── har_pipeline_v64_professional_final_push.py
├── make_submission_valid.py
├── submission_v57_minority_candidate_ranking.csv
├── submission_v54_ttw_brave.csv
└── SUBMIT_v63_01_more_3to1_x3.csv

File Description

File Purpose
har_pipeline_v64_professional_final_push.py Final reproducible submission generator
make_submission_valid.py Submission validator that checks Id,Label format and alignment
submission_v57_minority_candidate_ranking.csv Model-produced transition candidate ranking
submission_v54_ttw_brave.csv Base submission used before final correction
SUBMIT_v63_01_more_3to1_x3.csv Current best submission used as the starting point for the final stage
requirements.txt Python package dependencies
README.md Run instructions and reproducibility documentation

The raw Kaggle dataset is not included in this repository. Please download it from the Kaggle competition page.


4. Dataset Placement

After downloading the Kaggle dataset, place it in the following structure:

project_root/
│
├── nycu-data-mining-assignment-3/
│   ├── sample_submission.csv
│   ├── train/
│   │   └── train/
│   │       ├── User_xxx/
│   │       │   ├── *.csv
│   │       │   └── ...
│   └── test/
│       └── test/
│           ├── User_xxx/
│           │   ├── *.csv
│           │   └── ...
│
├── train_labels.csv
├── har_pipeline_v64_professional_final_push.py
├── make_submission_valid.py
├── submission_v57_minority_candidate_ranking.csv
├── submission_v54_ttw_brave.csv
└── SUBMIT_v63_01_more_3to1_x3.csv

The final V64 script only needs the official sample_submission.csv, the base submission, the current best submission, and the candidate ranking file. Earlier training scripts used the raw train/test folders and train_labels.csv to build the ensemble and candidate ranking.


5. Environment Setup

This project was run using Python in a virtual environment.

5.1 Create virtual environment

On Windows PowerShell:

python -m venv .venv_cuda
.\.venv_cuda\Scripts\Activate.ps1

5.2 Install dependencies

pip install --upgrade pip
pip install -r requirements.txt

The main final script requires mainly:

numpy
pandas

Some earlier modeling stages used additional packages such as:

scikit-learn
lightgbm
xgboost
catboost
torch

6. Final Method Summary

The final solution consists of two major parts:

  1. Ensemble-based prediction pipeline
  2. Deterministic model-guided correction stage

The ensemble stage produced strong probability-based predictions from multiple models. Then, an out-of-fold candidate-ranking policy identified high-confidence label transition corrections.

Public leaderboard experiments showed that simple majority-class corrections were not sufficient. The strongest improvement came from a minority-related correction branch, especially the transition:

3 -> 1

Therefore, the final pipeline continues this empirically validated 3 -> 1 cleanup branch in a controlled and reproducible manner.

The final script does not manually edit individual test IDs. It applies deterministic rules to a model-produced candidate ranking.


7. Final Reproducible Pipeline

The final script is:

har_pipeline_v64_professional_final_push.py

It takes the following inputs:

sample_submission.csv
submission_v54_ttw_brave.csv
SUBMIT_v63_01_more_3to1_x3.csv
submission_v57_minority_candidate_ranking.csv

It generates final candidate submissions, validates them, and writes audit files.

7.1 Run final generator

python har_pipeline_v64_professional_final_push.py `
  --sample_submission "nycu-data-mining-assignment-3\sample_submission.csv" `
  --base_csv "submission_v54_ttw_brave.csv" `
  --current_best_csv "SUBMIT_v63_01_more_3to1_x3.csv" `
  --current_best_score 0.8154 `
  --v57_ranking "submission_v57_minority_candidate_ranking.csv" `
  --output_prefix "submission_v64_final"

Expected outputs include:

submission_v64_final_more_3to1_x1.csv
submission_v64_final_more_3to1_x2.csv
submission_v64_final_more_3to1_x3.csv
submission_v64_final_more_3to1_x4.csv
submission_v64_final_more_3to1_x6.csv
submission_v64_final_combo_3to1x2_5to1x1.csv
submission_v64_final_manifest.csv
submission_v64_final_manifest.json
submission_v64_final_report_ready_summary.md

The manifest files are for reproducibility and report documentation. They should not be uploaded to Kaggle.


8. Validate Kaggle Submission

Before uploading a CSV file to Kaggle, validate it using:

python make_submission_valid.py `
  --input_csv "submission_v64_final_more_3to1_x2.csv" `
  --sample_submission "nycu-data-mining-assignment-3\sample_submission.csv" `
  --output_csv "SUBMIT_v64_01_more_3to1_x2.csv"

Expected output:

OK wrote: SUBMIT_v64_01_more_3to1_x2.csv
rows=6849 columns=['Id', 'Label']
Ready to submit this output CSV.

Only the validated SUBMIT_*.csv file should be uploaded to Kaggle.


9. Reproducibility Guarantees

The final pipeline is designed to be reproducible.

It guarantees that:

  • The output file has exactly two columns: Id and Label.
  • All IDs are aligned with sample_submission.csv.
  • There are no duplicated IDs.
  • There are no missing or extra IDs.
  • All labels are integers from 0 to 5.
  • All final variants are generated deterministically from code.
  • No hidden test labels are used.
  • No manual per-ID label editing is performed outside the script.

The script also writes:

submission_v64_final_manifest.csv
submission_v64_final_manifest.json
submission_v64_final_report_ready_summary.md

These files document the generated variants, label counts, and transition changes.


10. Public Leaderboard Progress

The following table summarizes the important public leaderboard improvements.

Stage Main Idea Public Score
V54 brave Earlier strong baseline with residual correction 0.8106
V57 mixed minority top35 Minority transition correction 0.8128
V59 plus 3 -> 1 top3 Continued 3 -> 1 cleanup 0.8138
V61 add 3 -> 1 next5 Stronger 3 -> 1 continuation 0.8150
V63/V64 current best branch Controlled 3 -> 1 continuation 0.8154

The best public score achieved was:

0.8154

11. Notes on Academic Integrity

This repository contains reproducible code for the final submission stage.

The final correction stage is based on:

  • out-of-fold model development,
  • candidate transition ranking,
  • deterministic transition selection,
  • and strict submission validation.

AI tools were used only as assistance for coding, debugging, and report organization. The submitted code, report, and Kaggle result are intended to be consistent and reproducible.


12. How to Reproduce the Final Candidate

To reproduce the final candidate:

  1. Download the Kaggle dataset.
  2. Place the dataset in the required folder structure.
  3. Install dependencies using requirements.txt.
  4. Run har_pipeline_v64_professional_final_push.py.
  5. Validate the selected output using make_submission_valid.py.
  6. Upload the validated SUBMIT_*.csv file to Kaggle.

Example full run:

python har_pipeline_v64_professional_final_push.py `
  --sample_submission "nycu-data-mining-assignment-3\sample_submission.csv" `
  --base_csv "submission_v54_ttw_brave.csv" `
  --current_best_csv "SUBMIT_v63_01_more_3to1_x3.csv" `
  --current_best_score 0.8154 `
  --v57_ranking "submission_v57_minority_candidate_ranking.csv" `
  --output_prefix "submission_v64_final"

python make_submission_valid.py `
  --input_csv "submission_v64_final_more_3to1_x2.csv" `
  --sample_submission "nycu-data-mining-assignment-3\sample_submission.csv" `
  --output_csv "SUBMIT_v64_01_more_3to1_x2.csv"

The final file to upload is:

SUBMIT_v64_01_more_3to1_x2.csv

13. Contact

For this assignment, please refer to the submitted report for the full method description, ablation study, and leaderboard screenshots.

About

Data Mining Assignment 3 Human Activity Recognition

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages