Skip to content

Conversation

@VibhuJawa
Copy link
Contributor

@VibhuJawa VibhuJawa commented Jan 7, 2026

PR: Code Annotation (Add code annotation library with Rust-based quality signals)

Description

This PR introduces a new code annotation library for NeMo Curator that provides fast, Rust-based annotation functions for code data curation. The library enables language detection, basic statistics, software metrics, comment fraction analysis, and tokenization.

Features

  • Rust Library (nemo_curator/code_annotation/): High-performance annotation functions using PyO3 bindings

    • detect_language: Programming language detection via hyperpolyglot
    • basic: Basic statistics (bytes, lines, patterns, XML detection)
    • software_metrics: Code complexity metrics via rust-code-analysis
    • opencoder_rs: Comment line/character fractions
    • tokenize: BPE tokenization (github_o200k_base, tiktoken_o200k_base)
  • Document Modifiers (nemo_curator/stages/code/):

    • CodeLanguageDetector
    • CodeBasicStats
    • CodeSoftwareMetrics
    • CodeOpenCoderMetrics
    • CodeTokenizer
    • CodeAnnotator (all-in-one convenience modifier)
  • Document Filters (nemo_curator/stages/text/filters/code.py):

    • CommentFractionFilter
    • MaxLineLengthFilter
    • AverageLineLengthFilter
    • AlphaPercentFilter
    • HexContentFilter
    • Base64ContentFilter
    • TokenCountFilter
    • CyclomaticComplexityFilter

Usage Example

import pandas as pd
from nemo_curator.stages.code import CodeAnnotator

df = pd.DataFrame({
    'content': ['def hello(): pass', 'fn main() {}'],
    'representative_filename': ['test.py', 'main.rs'],
})

annotator = CodeAnnotator(
    detect_language=True,
    basic_stats=True,
    software_metrics=True,
    opencoder_metrics=True,
    tokenize=True,
)
result = annotator.modify_document(df)

Testing

  • 21 unit tests added in tests/code_annotation/test_annotate.py
  • All tests passing
python -m pytest tests/code_annotation/test_annotate.py -v

Files Changed

New Files:

  • nemo_curator/code_annotation/ - Rust library with PyO3 bindings
  • nemo_curator/stages/code/__init__.py - Code stages module
  • nemo_curator/stages/code/modifiers.py - Document modifiers
  • tests/code_annotation/test_annotate.py - Unit tests
  • examples/code_annotation/annotate_code.py - Annotation example
  • examples/code_annotation/filter_code.py - Filtering example
  • docs/code_annotation_plan.md - Documentation

Modified Files:

  • nemo_curator/stages/text/filters/code.py - Added 8 new filters

Build Instructions

cd nemo_curator/code_annotation
maturin develop  # Development build
# OR
maturin build --release  # Release wheel

Dependencies

  • Python: pandas, pyarrow, maturin
  • Rust: pyo3, hyperpolyglot, software-metrics, tiktoken-rs, bpe-openai

Checklist

  • Code compiles without errors
  • Tests added and passing (21 tests)
  • Documentation added
  • Example scripts provided
  • Follows NeMo Curator coding standards

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 7, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temp file to aid development, will be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants