codegen-nlp is a research project focused on generating Python code from natural-language algorithmic descriptions, with special attention to time and space complexity constraints.
Dataset: BigO(Bench)
Models: T5-small (trained from scratch), CodeT5-small (fine-tuned), CodeGen-2B-mono, DeepSeek 1.3B, Gemini 2.0 Flash
Evaluation: CodeBLEU
Extensions: Complexity classifier, prompt engineering, retrieval-augmented generation (RAG)
codegen-nlp/
├── FullNotebook.ipynb ← Jupyter notebook: analysis → training → evaluation
├── README.md ← You are here
├── TASK.md ← Project plan & task breakdown
├── .gitignore ← Standard gitignore
└── experiments/ ← Modular pipeline code
├── analysis/ ← Dataset exploration and stats
├── preprocessing/ ← Cleaning, tokenisation, formatting
└── training/ ← Model fine-tuning and comparison
Project plan: see TASK.md for full phase details and extension ideas.
Project plan: see TASK.md for full phase details and extension ideas.
-
Analysis
- BigO(Bench) dataset inspection via HuggingFace
- Visualizations: problem lengths, token stats, solution distribution
- Unsupervised clustering of topics (e.g., graphs, trees, DP)
-
Preprocessing
- Deduplication, text normalization, filtering
- Train/val/test split
- Input formatting for generation models
-
Code Generation
- T5-small (scratch): trained end-to-end on BigO prompts
- CodeT5-small: fine-tuned and compared across multiple seeds
- LLMs: evaluated zero-, one-, and few-shot with:
- CodeGen-2B-mono
- DeepSeek 1.3B Code Instruct
- Gemini 2.0 Flash
- Few-shot prompting yielded +15% CodeBLEU and doubled test pass rate
-
Evaluation
- Text-based: BLEU, CodeBLEU, ROUGE
- Functional: Unit tests with auto-execution and timeout checks
-
Extensions
- Complexity classification (code-only, description+code, fine-tuned CodeBERT)
- Gemini prompt engineering with explicit complexity constraints
- RAG prototype (retrieval of similar solved problems)
Clone the repository and open the main notebook:
git clone https://github.com/your-user/codegen-nlp.git
cd codegen-nlp
jupyter lab FullNotebook.ipynb- Giovanni Vaccarino
- Niccolo Salvi
- Nicolò Vacis
- Vittorio Palladino