Skip to content

BREAKING FEAT: introduce word-level converter #847

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 48 commits into
base: main
Choose a base branch
from

Conversation

paulinek13
Copy link
Contributor

@paulinek13 paulinek13 commented Mar 31, 2025

Description

This PR introduces a new base class called WordLevelConverter, which simplifies the creation of word-level converters by providing a reusable foundation that standardizes word selection for transformation and reduces code duplication across similar converters.

The key benefit is that one only needs to implement the specific word transformation logic (convert_word_async) while the base class handles word selection, iteration, and final result.

Word selection strategies/modes

The base class supports various word selection modes through the select_word_indices util function:

  • all - convert/transform every word in the prompt (default)
  • keywords - only specific words provided in a keywords list
  • random - a random subset of words based on a specified percentage
  • regex - words that match a regular expression pattern
  • custom - only specifically chosen word indices

List of refactored prompt converters

The following converters have been refactored to use the new base class:

  • BinaryConverter
  • CharSwapGenerator
  • EmojiConverter
  • LeetspeakConverter
  • ROT13Converter
  • StringJoinConverter
  • TextToHexConverter
  • UnicodeReplacementConverter

Note: I'm not sure if all the prompt converters that I've refactored should be word-level based, or if there are other converters that haven't been refactored that would benefit from this base class.

Related: #818 (comment)


Tests and Documentation

Updated docs and tests

@romanlutz romanlutz requested a review from rlundeen2 March 31, 2025 19:53
@paulinek13 paulinek13 force-pushed the feat/add_word_level_converter branch 2 times, most recently from 863fc37 to 8d88759 Compare April 12, 2025 18:06
@paulinek13 paulinek13 force-pushed the feat/add_word_level_converter branch from 8d88759 to 12b79e1 Compare April 12, 2025 18:09
@paulinek13
Copy link
Contributor Author

Thanks for the reviews! I'll make the requested changes and let you know once they’re all ready (might be a few days 😃)

@romanlutz romanlutz mentioned this pull request Apr 18, 2025
@paulinek13 paulinek13 force-pushed the feat/add_word_level_converter branch from 54c574e to 14cb1a5 Compare April 27, 2025 17:30
@romanlutz
Copy link
Contributor

Zalgo is merged now so you can add it as well 🙂

@romanlutz
Copy link
Contributor

@paulinek13 there are still two open comment threads as far as I can see. Let me know if I should elaborate on anything!

@paulinek13
Copy link
Contributor Author

paulinek13 commented Apr 29, 2025

@romanlutz Thanks!

Yes, I'm aware of the remaining threads. I haven't resolved them yet because I'm just not quite satisfied with my original approach to initializing the word-level converters 🙂

After giving it some more thought, I believe it might be cleaner to move the selection configuration into separate methods, example:

converter = CharSwapConverter(max_iterations=1).select_random(proportion=0.5)

This seems to reduce repeating things in both docs and code, and could offer some other maintainability benefits as well (like making it easier to add a new converter based on WordLevelConverter and not having to copy-paste the docstring part with the args every time).

So this should address the following: #847 (comment) and #847 (comment)

Seems like a nicer pattern overall. I have to say I like it 😄 No mode_kwargs, just methods allowing to change the selection of words:

class WordLevelConverter(PromptConverter):
	# ...
    @final
    def select_keywords(self: T, keywords: List[str]) -> T:
        """Configure the converter to only process words matching specific keywords."""
        self._selection_mode = "keywords"
        self._selection_keywords = keywords
        return self
	# ...

I hope it makes sense. I'll push this change shortly for you to review (if it won't be good we can always revert it 😃)

BTW, This PR’s taking a bit more time than planned, so thanks for bearing with me 😅

Edit:
The related commit: use special methods insetad of kwargs for word selection configuration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants