Skip to content

Latest commit

 

History

History
237 lines (167 loc) · 4.04 KB

File metadata and controls

237 lines (167 loc) · 4.04 KB

Canonical CIDX Refactored Dataset

AI assisted workflow and dataset engineering pipeline for constructing a canonical Chain Indexing (CIDX) refactoring dataset focused on Pandas chained indexing patterns in ML/DL repositories.


Project Overview

This repository contains:

  • extracted Chain Indexing samples
  • taxonomy refinement workflow
  • semantic filtering pipeline
  • refactored code samples
  • validation utilities
  • dataset reconstruction scripts
  • runtime verification workflow

The project was developed as part of the internship:

AI4SE: Application of LLMs in Software Engineering


Research Objective

The primary objective of this work is to construct a high quality canonical Chain Indexing refactoring dataset by:

  • identifying valid Pandas chained indexing patterns
  • filtering non canonical traversal/indexing patterns
  • generating semantically meaningful refactorings
  • validating runtime behavior
  • preserving reproducibility through automated workflows

Canonical Chain Indexing Definition

This project follows a strict canonical interpretation of Chain Indexing.

VALID Canonical Examples

df['A'][0]
df['B'][1] = 10
data['x']['y']

These patterns may introduce:

  • intermediate object creation
  • SettingWithCopy ambiguity
  • readability concerns
  • unintended dataframe modification behavior

Non Canonical Examples (Rejected)

The following patterns are NOT considered canonical Chain Indexing under the refined taxonomy.

Tensor Traversal

tensor[0][0]

Shape Access

shape[0]

Framework Traversal

model.state_dict()[key]

Attribute Traversal

result.logits.shape[0]

Repository Structure

canonical_cidx_refactored_dataset/
│
├── data/
│   ├── raw/
│   ├── processed/
│
├── scripts/
│
├── notes/
│
├── outputs/
│
└── README.md

Dataset Pipeline

Raw Extraction
    ↓
Chain Sample Identification
    ↓
AI Assisted Refactoring
    ↓
Semantic Filtering
    ↓
Canonical Taxonomy Refinement
    ↓
Runtime Verification
    ↓
Final Dataset Merge

Validation Workflow

The validation pipeline combines:

  • AST based syntactic validation
  • semantic/manual inspection
  • runtime verification using Google Colab
  • duplicate detection
  • smell location verification
  • canonical Pandas filtering

Scripts

The scripts/ directory contains utilities for:

  • dataset extraction
  • refactoring entry generation
  • rejection handling
  • duplicate checking
  • smell location inspection
  • runtime verification support
  • dataset merging

Validation Categories

Quantitative Validation

Includes:

  • extracted sample statistics
  • rejected sample counts
  • duplicate analysis
  • final merge statistics

Qualitative Validation

Includes:

  • canonical smell verification
  • semantic preservation checks
  • runtime behavior validation
  • AST based structural validation

Dataset Files

Main Files

File Purpose
merged_dataset.json Master merged recovery dataset
refactoredCIDXStrict_dataset.json Canonical validated dataset
noID_refactored_chain_samples.json Legacy refactored samples without IDs
nonCIDX_refactored_chain_samples.json Mixed taxonomy recovery dataset

Workflow Documentation

Additional workflow notes and validation procedures are available in:

notes/cidx_workflow_document.md


Authors

Robin Bijo

Research Student
VIT Vellore

Answin Mariya

Research Student
NIT Calicut


Mentors

Indulekha K K, PhD

NIT Calicut

Dr. Shweta

Assistant Professor
NIT Calicut


Internship

AI4SE: Application of LLMs in Software Engineering


Notes

This repository represents an iterative dataset engineering workflow involving:

  • AI assisted semantic analysis
  • human verified validation
  • taxonomy refinement
  • canonical filtering
  • runtime verification procedures

The final dataset focuses specifically on canonical Pandas chained indexing refactoring patterns.