-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Apply special cases to tokens split by infixes #5772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply special cases to tokens split by infixes #5772
Conversation
* Combine methods for special case and cache checking to simplify code * Check for special cases after splitting on infixes
Looks good, and it'll be nice to have this inconsistency resolved. The infixes weren't part of the original tokenizer design so I was never 100% sure how they should interact with the special-cases. Have you evaluated the change to check what tokenizes differently in a sample of text? I don't think the runtime will be affected but it's good to check that quickly as well. I'm also a little worried about how breaking this might be for people's tokenization rules. It's awkward to add an option, but if this makes a big difference I suppose that's something we could consider too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll put the "needs evaluation (and maybe benchmark)" under a 'request changes' mark to remind myself not to merge this yet.
I benchmarked the speed in advance (no measurable difference), but you're right that it needs a more thorough evaluation for all the training corpora. (I think it's going to be rare enough that it's hard to find cases where it makes a difference.) The other option is to add an option to extend the matcher-based special case handling. Right now the matcher-based special cases are only those that contain affix punctuation (for speed), but you could add an option to use the matcher-based handling for all the special cases, which would address this. It would be slower overall, but would potentially be more consistent. I'm not sure whether I'd want fast-without-infixes and slow-with-infixes to be the setting, though. The way the matcher special cases work, though, you couldn't get slow-without-infixes, though, since it matches over all the tokens, so you could get fast-with-infixes, fast-without-infixes, or slow-with-infixes, if that makes sense. |
Wow, I was really wrong about this:
I guess we could have a slow-with-infixes option for people who really wanted the more consistent behavior in a particular instance, but I don't think we should make this the default. I ran it on some English wikipedia text and in a lot of cases it's better, like contractions next to emdash or |
Ouch :( Okay I think this will be too breaking, especially to users with special cases. Should we just close this? I don't want to have too many of these obscure options, and the tokenizer is already complex enough, I'd rather not carry unused code. |
I would keep the simplified code bits and consider adding an option for the thorough matcher special case handling, since I think it could be useful in some cases to have this be consistent rather than like the current tokenizer. |
Description
Fixes #5598.
Types of change
Checklist