[DRAFT] Code Annotaion WIP #1356
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR: Code Annotation (Add code annotation library with Rust-based quality signals)
Description
This PR introduces a new code annotation library for NeMo Curator that provides fast, Rust-based annotation functions for code data curation. The library enables language detection, basic statistics, software metrics, comment fraction analysis, and tokenization.
Features
Rust Library (
nemo_curator/code_annotation/): High-performance annotation functions using PyO3 bindingsdetect_language: Programming language detection via hyperpolyglotbasic: Basic statistics (bytes, lines, patterns, XML detection)software_metrics: Code complexity metrics via rust-code-analysisopencoder_rs: Comment line/character fractionstokenize: BPE tokenization (github_o200k_base, tiktoken_o200k_base)Document Modifiers (
nemo_curator/stages/code/):CodeLanguageDetectorCodeBasicStatsCodeSoftwareMetricsCodeOpenCoderMetricsCodeTokenizerCodeAnnotator(all-in-one convenience modifier)Document Filters (
nemo_curator/stages/text/filters/code.py):CommentFractionFilterMaxLineLengthFilterAverageLineLengthFilterAlphaPercentFilterHexContentFilterBase64ContentFilterTokenCountFilterCyclomaticComplexityFilterUsage Example
Testing
tests/code_annotation/test_annotate.pyFiles Changed
New Files:
nemo_curator/code_annotation/- Rust library with PyO3 bindingsnemo_curator/stages/code/__init__.py- Code stages modulenemo_curator/stages/code/modifiers.py- Document modifierstests/code_annotation/test_annotate.py- Unit testsexamples/code_annotation/annotate_code.py- Annotation exampleexamples/code_annotation/filter_code.py- Filtering exampledocs/code_annotation_plan.md- DocumentationModified Files:
nemo_curator/stages/text/filters/code.py- Added 8 new filtersBuild Instructions
Dependencies
pandas,pyarrow,maturinpyo3,hyperpolyglot,software-metrics,tiktoken-rs,bpe-openaiChecklist