Description of the feature request:
It would be helpful to have a simple processor in genai_processors/contrib that converts all text in incoming ProcessorParts to lowercase. This would make it easier for users to build normalization pipelines.
Proposed API:
Location: genai_processors/contrib/lowercase_text_processor.py
Class: LowercaseTextProcessor
Inherits from: PartProcessor
Logic: If the part is text (is_text(part.mimetype)), convert to lowercase; else, yield unchanged. All metadata is preserved.
What problem are you trying to solve with this feature?
-
Tokenization might use "Hello", "hello", and "HELLO" as different number of tokens. Lowercasing ensures that "Hello", "hello", and "HELLO" are treated the same.
-
Improved search and matching
Any other information you'd like to share?
No response
Description of the feature request:
It would be helpful to have a simple processor in
genai_processors/contribthat converts all text in incoming ProcessorParts to lowercase. This would make it easier for users to build normalization pipelines.Proposed API:
Location:
genai_processors/contrib/lowercase_text_processor.pyClass:
LowercaseTextProcessorInherits from:
PartProcessorLogic: If the part is text
(is_text(part.mimetype)), convert to lowercase; else, yield unchanged. All metadata is preserved.What problem are you trying to solve with this feature?
Tokenization might use "Hello", "hello", and "HELLO" as different number of tokens. Lowercasing ensures that "Hello", "hello", and "HELLO" are treated the same.
Improved search and matching
Any other information you'd like to share?
No response