Skip to content

RobinBijo/canonical_cidx_refactored_dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Canonical CIDX Refactored Dataset

AI assisted workflow and dataset engineering pipeline for constructing a canonical Chain Indexing (CIDX) refactoring dataset focused on Pandas chained indexing patterns in ML/DL repositories.


Project Overview

This repository contains:

  • extracted Chain Indexing samples
  • taxonomy refinement workflow
  • semantic filtering pipeline
  • refactored code samples
  • validation utilities
  • dataset reconstruction scripts
  • runtime verification workflow

The project was developed as part of the internship:

AI4SE: Application of LLMs in Software Engineering


Research Objective

The primary objective of this work is to construct a high quality canonical Chain Indexing refactoring dataset by:

  • identifying valid Pandas chained indexing patterns
  • filtering non canonical traversal/indexing patterns
  • generating semantically meaningful refactorings
  • validating runtime behavior
  • preserving reproducibility through automated workflows

Canonical Chain Indexing Definition

This project follows a strict canonical interpretation of Chain Indexing.

VALID Canonical Examples

df['A'][0]
df['B'][1] = 10
data['x']['y']

These patterns may introduce:

  • intermediate object creation
  • SettingWithCopy ambiguity
  • readability concerns
  • unintended dataframe modification behavior

Non Canonical Examples (Rejected)

The following patterns are NOT considered canonical Chain Indexing under the refined taxonomy.

Tensor Traversal

tensor[0][0]

Shape Access

shape[0]

Framework Traversal

model.state_dict()[key]

Attribute Traversal

result.logits.shape[0]

Repository Structure

canonical_cidx_refactored_dataset/
│
├── data/
│   ├── raw/
│   ├── processed/
│
├── scripts/
│
├── notes/
│
├── outputs/
│
└── README.md

Dataset Pipeline

Raw Extraction
    ↓
Chain Sample Identification
    ↓
AI Assisted Refactoring
    ↓
Semantic Filtering
    ↓
Canonical Taxonomy Refinement
    ↓
Runtime Verification
    ↓
Final Dataset Merge

Validation Workflow

The validation pipeline combines:

  • AST based syntactic validation
  • semantic/manual inspection
  • runtime verification using Google Colab
  • duplicate detection
  • smell location verification
  • canonical Pandas filtering

Scripts

The scripts/ directory contains utilities for:

  • dataset extraction
  • refactoring entry generation
  • rejection handling
  • duplicate checking
  • smell location inspection
  • runtime verification support
  • dataset merging

Validation Categories

Quantitative Validation

Includes:

  • extracted sample statistics
  • rejected sample counts
  • duplicate analysis
  • final merge statistics

Qualitative Validation

Includes:

  • canonical smell verification
  • semantic preservation checks
  • runtime behavior validation
  • AST based structural validation

Dataset Files

Main Files

File Purpose
merged_dataset.json Master merged recovery dataset
refactoredCIDXStrict_dataset.json Canonical validated dataset
noID_refactored_chain_samples.json Legacy refactored samples without IDs
nonCIDX_refactored_chain_samples.json Mixed taxonomy recovery dataset

Workflow Documentation

Additional workflow notes and validation procedures are available in:

notes/cidx_workflow_document.md


Authors

Robin Bijo

Research Student
VIT Vellore

Answin Mariya

Research Student
NIT Calicut


Mentors

Indulekha K K, PhD

NIT Calicut

Dr. Shweta

Assistant Professor
NIT Calicut


Internship

AI4SE: Application of LLMs in Software Engineering


Notes

This repository represents an iterative dataset engineering workflow involving:

  • AI assisted semantic analysis
  • human verified validation
  • taxonomy refinement
  • canonical filtering
  • runtime verification procedures

The final dataset focuses specifically on canonical Pandas chained indexing refactoring patterns.

About

This repository holds the workflow scripts and AI Agentic Workflow Document to aid in the Chain Indexing code smell detection and validation

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages