AI assisted workflow and dataset engineering pipeline for constructing a canonical Chain Indexing (CIDX) refactoring dataset focused on Pandas chained indexing patterns in ML/DL repositories.
This repository contains:
- extracted Chain Indexing samples
- taxonomy refinement workflow
- semantic filtering pipeline
- refactored code samples
- validation utilities
- dataset reconstruction scripts
- runtime verification workflow
The project was developed as part of the internship:
AI4SE: Application of LLMs in Software Engineering
The primary objective of this work is to construct a high quality canonical Chain Indexing refactoring dataset by:
- identifying valid Pandas chained indexing patterns
- filtering non canonical traversal/indexing patterns
- generating semantically meaningful refactorings
- validating runtime behavior
- preserving reproducibility through automated workflows
This project follows a strict canonical interpretation of Chain Indexing.
df['A'][0]df['B'][1] = 10data['x']['y']These patterns may introduce:
- intermediate object creation
- SettingWithCopy ambiguity
- readability concerns
- unintended dataframe modification behavior
The following patterns are NOT considered canonical Chain Indexing under the refined taxonomy.
tensor[0][0]shape[0]model.state_dict()[key]result.logits.shape[0]canonical_cidx_refactored_dataset/
│
├── data/
│ ├── raw/
│ ├── processed/
│
├── scripts/
│
├── notes/
│
├── outputs/
│
└── README.md
Raw Extraction
↓
Chain Sample Identification
↓
AI Assisted Refactoring
↓
Semantic Filtering
↓
Canonical Taxonomy Refinement
↓
Runtime Verification
↓
Final Dataset Merge
The validation pipeline combines:
- AST based syntactic validation
- semantic/manual inspection
- runtime verification using Google Colab
- duplicate detection
- smell location verification
- canonical Pandas filtering
The scripts/ directory contains utilities for:
- dataset extraction
- refactoring entry generation
- rejection handling
- duplicate checking
- smell location inspection
- runtime verification support
- dataset merging
Includes:
- extracted sample statistics
- rejected sample counts
- duplicate analysis
- final merge statistics
Includes:
- canonical smell verification
- semantic preservation checks
- runtime behavior validation
- AST based structural validation
| File | Purpose |
|---|---|
| merged_dataset.json | Master merged recovery dataset |
| refactoredCIDXStrict_dataset.json | Canonical validated dataset |
| noID_refactored_chain_samples.json | Legacy refactored samples without IDs |
| nonCIDX_refactored_chain_samples.json | Mixed taxonomy recovery dataset |
Additional workflow notes and validation procedures are available in:
notes/cidx_workflow_document.md
Research Student
VIT Vellore
Research Student
NIT Calicut
NIT Calicut
Assistant Professor
NIT Calicut
AI4SE: Application of LLMs in Software Engineering
This repository represents an iterative dataset engineering workflow involving:
- AI assisted semantic analysis
- human verified validation
- taxonomy refinement
- canonical filtering
- runtime verification procedures
The final dataset focuses specifically on canonical Pandas chained indexing refactoring patterns.