Skip to content

giovanni-vaccarino/codegen-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

codegen-nlp: Automatic Code Generation from Natural Language Descriptions

codegen-nlp is a research project focused on generating Python code from natural-language algorithmic descriptions, with special attention to time and space complexity constraints.

Dataset: BigO(Bench)
Models: T5-small (trained from scratch), CodeT5-small (fine-tuned), CodeGen-2B-mono, DeepSeek 1.3B, Gemini 2.0 Flash
Evaluation: CodeBLEU
Extensions: Complexity classifier, prompt engineering, retrieval-augmented generation (RAG)


Repository Structure

codegen-nlp/
├── FullNotebook.ipynb   ← Jupyter notebook: analysis → training → evaluation
├── README.md            ← You are here
├── TASK.md              ← Project plan & task breakdown
├── .gitignore           ← Standard gitignore
└── experiments/         ← Modular pipeline code
    ├── analysis/        ← Dataset exploration and stats
    ├── preprocessing/   ← Cleaning, tokenisation, formatting
    └── training/        ← Model fine-tuning and comparison

Project plan: see TASK.md for full phase details and extension ideas.

Project plan: see TASK.md for full phase details and extension ideas.


Pipeline Overview

  1. Analysis

    • BigO(Bench) dataset inspection via HuggingFace
    • Visualizations: problem lengths, token stats, solution distribution
    • Unsupervised clustering of topics (e.g., graphs, trees, DP)
  2. Preprocessing

    • Deduplication, text normalization, filtering
    • Train/val/test split
    • Input formatting for generation models
  3. Code Generation

    • T5-small (scratch): trained end-to-end on BigO prompts
    • CodeT5-small: fine-tuned and compared across multiple seeds
    • LLMs: evaluated zero-, one-, and few-shot with:
      • CodeGen-2B-mono
      • DeepSeek 1.3B Code Instruct
      • Gemini 2.0 Flash
    • Few-shot prompting yielded +15% CodeBLEU and doubled test pass rate
  4. Evaluation

    • Text-based: BLEU, CodeBLEU, ROUGE
    • Functional: Unit tests with auto-execution and timeout checks
  5. Extensions

    • Complexity classification (code-only, description+code, fine-tuned CodeBERT)
    • Gemini prompt engineering with explicit complexity constraints
    • RAG prototype (retrieval of similar solved problems)

How to Run

Clone the repository and open the main notebook:

git clone https://github.com/your-user/codegen-nlp.git
cd codegen-nlp
jupyter lab FullNotebook.ipynb

Authors

  • Giovanni Vaccarino
  • Niccolo Salvi
  • Nicolò Vacis
  • Vittorio Palladino

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors