-
Notifications
You must be signed in to change notification settings - Fork 212
Add LanguageDetectProcessor for automatic language detection #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ayush31010
wants to merge
6
commits into
google-gemini:main
Choose a base branch
from
ayush31010:Add-LanguageDetectProcessor-for-automatic-lang-detection-#14
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
2eb2c2d
Fix empty text handling in LanguageDetectProcessor
ayush31010 f64fbb2
Update genai_processors/contrib/language_detect_processor.py
ayush31010 168a825
Update genai_processors/contrib/language_detect_processor.py
ayush31010 e503cfc
Update genai_processors/contrib/language_detect_processor.py
ayush31010 c5e4331
Update genai_processors/contrib/language_detect_processor.py
ayush31010 998d269
changed
ayush31010 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| """Check the actual structure of genai_processors.""" | ||
|
|
||
| import os | ||
| import ast | ||
|
|
||
| def find_base_classes(): | ||
| """Find processor base classes in the codebase.""" | ||
| core_path = "genai_processors/core" | ||
|
|
||
| for filename in os.listdir(core_path): | ||
| if filename.endswith('.py') and filename != '__init__.py': | ||
| filepath = os.path.join(core_path, filename) | ||
| try: | ||
| with open(filepath, 'r', encoding='utf-8') as f: | ||
| content = f.read() | ||
| if 'class' in content and ('Processor' in content or 'Part' in content): | ||
| print(f"\n{filename}:") | ||
| tree = ast.parse(content) | ||
| for node in ast.walk(tree): | ||
| if isinstance(node, ast.ClassDef): | ||
| bases = [b.id if isinstance(b, ast.Name) else str(b) for b in node.bases] | ||
| print(f" - {node.name} (bases: {bases})") | ||
| except Exception as e: | ||
| print(f"Error reading {filename}: {e}") | ||
|
|
||
| if __name__ == "__main__": | ||
| find_base_classes() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,110 @@ | ||
| """Language detection processor for text parts.""" | ||
|
|
||
| from typing import AsyncIterator, Any, Dict, Optional | ||
|
|
||
| try: | ||
| from langdetect import detect, DetectorFactory, LangDetectException | ||
| # Set seed for consistent results | ||
| DetectorFactory.seed = 0 | ||
| except ImportError as e: | ||
| raise ImportError( | ||
| "langdetect is required for LanguageDetectProcessor. " | ||
| "Install it with: pip install 'genai-processors[contrib]' or pip install langdetect" | ||
| ) from e | ||
|
|
||
|
|
||
| # Define minimal classes if they don't exist | ||
| try: | ||
| from genai_processors.processor import Processor, ProcessorPart | ||
| except ImportError: | ||
| class ProcessorPart: | ||
| """Minimal ProcessorPart for language detection.""" | ||
| def __init__(self, text: Optional[str] = None, mimetype: str = "text/plain", | ||
| metadata: Optional[Dict[str, Any]] = None, **kwargs): | ||
| self.text = text | ||
| self.mimetype = mimetype | ||
| self.metadata = metadata or {} | ||
|
|
||
|
|
||
| def is_text(mimetype: str) -> bool: | ||
| """Check if mimetype is text-based.""" | ||
| return mimetype and (mimetype.startswith("text/") or "text" in mimetype.lower()) | ||
|
|
||
|
|
||
| class LanguageDetectProcessor(Processor): | ||
| """Detects the language of text parts and adds it to metadata. | ||
|
|
||
| This processor automatically detects the language of text parts using | ||
| the langdetect library and adds the detected language code (e.g., "en", "fr", "zh") | ||
| to the part's metadata. | ||
|
|
||
| Args: | ||
| metadata_key: The metadata key to store the detected language. | ||
| Defaults to "language". | ||
| unknown_label: The label to use when language detection fails. | ||
| Defaults to "unknown". | ||
| min_text_length: Minimum text length required for detection. | ||
| Shorter texts will be labeled as unknown. Defaults to 3. | ||
|
|
||
| Example: | ||
| ```python | ||
| from genai_processors.contrib import LanguageDetectProcessor | ||
|
|
||
| processor = LanguageDetectProcessor() | ||
| async for part in processor(part_stream): | ||
| print(part.metadata["language"]) # e.g., "en", "fr", "bn" | ||
| ``` | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| metadata_key: str = "language", | ||
| unknown_label: str = "unknown", | ||
| min_text_length: int = 3, | ||
| ): | ||
| """Initialize the LanguageDetectProcessor. | ||
|
|
||
| Args: | ||
| metadata_key: Metadata key for storing detected language. | ||
| unknown_label: Label for unknown/undetectable languages. | ||
| min_text_length: Minimum text length for detection. | ||
| """ | ||
| self.metadata_key = metadata_key | ||
| self.unknown_label = unknown_label | ||
| self.min_text_length = min_text_length | ||
|
|
||
| def _detect_language(self, text: str) -> str: | ||
| """Detect the language of the given text. | ||
|
|
||
| Args: | ||
| text: The text to detect language for. | ||
|
|
||
| Returns: | ||
| ISO 639-1 language code (e.g., "en", "fr", "zh") or unknown_label. | ||
| """ | ||
| if not text or len(text.strip()) < self.min_text_length: | ||
| return self.unknown_label | ||
|
|
||
| try: | ||
| return detect(text) | ||
| except LangDetectException: | ||
| return self.unknown_label | ||
|
|
||
| async def call( | ||
| self, part_stream: AsyncIterator[ProcessorPart] | ||
| ) -> AsyncIterator[ProcessorPart]: | ||
| """Process parts and add language detection to text parts. | ||
|
|
||
| Args: | ||
| part_stream: Stream of ProcessorPart objects. | ||
|
|
||
| Yields: | ||
| ProcessorPart objects with language metadata added to text parts. | ||
| """ | ||
| async for part in part_stream: | ||
| if is_text(part.mimetype) and part.text is not None: | ||
| # Detect language and add to metadata | ||
| detected_language = self._detect_language(part.text) | ||
| part.metadata[self.metadata_key] = detected_language | ||
|
|
||
| yield part | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| # This workflow will install Python dependencies, run tests and lint with a variety of Python versions | ||
| # For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python | ||
|
|
||
| name: Python package | ||
|
|
||
| on: | ||
| push: | ||
| branches: [ "main" ] | ||
| pull_request: | ||
| branches: [ "main" ] | ||
|
|
||
| jobs: | ||
| build: | ||
|
|
||
| runs-on: ubuntu-latest | ||
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| python-version: ["3.11", "3.12", "3.13"] | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@v4 | ||
| - name: Set up Python ${{ matrix.python-version }} | ||
| uses: actions/setup-python@v3 | ||
| with: | ||
| python-version: ${{ matrix.python-version }} | ||
| - name: Install dependencies | ||
| run: | | ||
| python -m pip install --upgrade pip | ||
| pip install . | ||
| pip install .[contrib] | ||
| pip install .[dev] | ||
| - name: Lint with flake8 | ||
| run: | | ||
| # stop the build if there are Python syntax errors or undefined names | ||
| flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics | ||
| # We run other checks in advisory mode, so --exit-zero treats all errors as warning. | ||
| # Line length is enforced by internal pylint so here we set a very generous limit. | ||
| flake8 . --count --exit-zero --max-complexity=10 --max-line-length=500 --indent-size=2 --statistics | ||
| - name: Test with pytest | ||
| run: | | ||
| pytest |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| """Quick import test for LanguageDetectProcessor.""" | ||
|
|
||
| import sys | ||
|
|
||
| def test_imports(): | ||
| """Test that all imports work correctly.""" | ||
| try: | ||
| print("Testing imports...") | ||
|
|
||
| # Test core imports | ||
| from genai_processors.core.part_processor import PartProcessor | ||
| print("✓ PartProcessor imported") | ||
|
|
||
| from genai_processors.core.processor_part import ProcessorPart | ||
| print("✓ ProcessorPart imported") | ||
|
|
||
| from genai_processors.core.utils import is_text | ||
| print("✓ is_text imported") | ||
|
|
||
| # Test langdetect | ||
| try: | ||
| from langdetect import detect | ||
| print("✓ langdetect imported") | ||
| except ImportError: | ||
| print("✗ langdetect not installed - run: pip install langdetect") | ||
| return False | ||
|
|
||
| # Test LanguageDetectProcessor | ||
| from genai_processors.contrib.language_detect_processor import LanguageDetectProcessor | ||
| print("✓ LanguageDetectProcessor imported") | ||
|
|
||
| # Test instantiation | ||
| processor = LanguageDetectProcessor() | ||
| print("✓ LanguageDetectProcessor instantiated") | ||
|
|
||
| # Test from contrib module | ||
| from genai_processors.contrib import LanguageDetectProcessor as LDP | ||
| print("✓ LanguageDetectProcessor imported from contrib") | ||
|
|
||
| print("\n✅ All imports successful!") | ||
| return True | ||
|
|
||
| except Exception as e: | ||
| print(f"\n✗ Error: {e}") | ||
| import traceback | ||
| traceback.print_exc() | ||
| return False | ||
|
|
||
| if __name__ == "__main__": | ||
| success = test_imports() | ||
| sys.exit(0 if success else 1) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| """Contrib tests module.""" |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.