Skip to content

Add LanguageDetectProcessor for automatic lang detection #14

@ArchishmanSengupta

Description

@ArchishmanSengupta

Description of the feature request:

I propose to add a LanguageDetectProcessor. This processor would automatically detect the language of each text ProcessorPart and add the detected language code (e.g., "en", "fr", "zh") to the part’s metadata.

We can use google's package called langdetect for the detection for minimum dependencies.

Proposed API:
Location: genai_processors/contrib/language_detect_processor.py
Class: LanguageDetectProcessor
Inherits from: PartProcessor

Logic:
If the part is text (is_text(part.mimetype)), detect the language using a library like langdetect and add the detected language code (e.g., "en", "fr") to the part’s metadata (default key: "language"). Optionally, store detection confidence. If detection fails, set language to "unknown" (configurable). All other metadata is preserved. Non-text parts are yielded unchanged.

Usage:

from genai_processors.contrib import LanguageDetectProcessor

processor = LanguageDetectProcessor()
async for part in processor(part_stream):
    print(part.metadata["language"])  # e.g., "en", "fr"

What problem are you trying to solve with this feature?

Many pipelines process data in multiple languages. For example, for my usecase, we get a lot of 'bengali' and other indo languages. Automatic language detection enables downstream processors to route, filter, or apply language-specific logic.

Any other information you'd like to share?

If this is a good and reasonable feature to implement, would like to get this issue assigned and complete it with the above API design proposed above.

Open to discussions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions