Skip to content

Apply special cases to tokens split by infixes #5772

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

adrianeboyd
Copy link
Contributor

Description

  • Combine methods for special case and cache checking to simplify code
  • Check for special cases after splitting on infixes

Fixes #5598.

Types of change

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

* Combine methods for special case and cache checking to simplify code
* Check for special cases after splitting on infixes
@adrianeboyd adrianeboyd added enhancement Feature requests and improvements feat / tokenizer Feature: Tokenizer labels Jul 17, 2020
@svlandeg svlandeg linked an issue Jul 17, 2020 that may be closed by this pull request
@honnibal
Copy link
Member

Looks good, and it'll be nice to have this inconsistency resolved. The infixes weren't part of the original tokenizer design so I was never 100% sure how they should interact with the special-cases.

Have you evaluated the change to check what tokenizes differently in a sample of text? I don't think the runtime will be affected but it's good to check that quickly as well.

I'm also a little worried about how breaking this might be for people's tokenization rules. It's awkward to add an option, but if this makes a big difference I suppose that's something we could consider too.

Copy link
Member

@honnibal honnibal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put the "needs evaluation (and maybe benchmark)" under a 'request changes' mark to remind myself not to merge this yet.

@adrianeboyd
Copy link
Contributor Author

I benchmarked the speed in advance (no measurable difference), but you're right that it needs a more thorough evaluation for all the training corpora. (I think it's going to be rare enough that it's hard to find cases where it makes a difference.)

The other option is to add an option to extend the matcher-based special case handling. Right now the matcher-based special cases are only those that contain affix punctuation (for speed), but you could add an option to use the matcher-based handling for all the special cases, which would address this. It would be slower overall, but would potentially be more consistent. I'm not sure whether I'd want fast-without-infixes and slow-with-infixes to be the setting, though. The way the matcher special cases work, though, you couldn't get slow-without-infixes, though, since it matches over all the tokens, so you could get fast-with-infixes, fast-without-infixes, or slow-with-infixes, if that makes sense.

@adrianeboyd
Copy link
Contributor Author

Wow, I was really wrong about this:

current this PR
da-core-news 99.85 99.84
de-dep-news 99.97 99.85
el-dep-news 99.98 99.97
en-core-web 99.92 99.43
es-dep-news 99.99 99.99
fr-dep-news 99.82 99.16
it-dep-news 99.93 98.10
lt-dep-news 100 100
nb-core-news 99.8 99.78
nl-dep-news 99.96 99.96
pl-dep-sz 100 99.91
pt-dep-news 99.93 99.86
ro-dep-mixed 99.84 99.84

I guess we could have a slow-with-infixes option for people who really wanted the more consistent behavior in a particular instance, but I don't think we should make this the default.

I ran it on some English wikipedia text and in a lot of cases it's better, like contractions next to emdash or / or ellipsis, but there are some like German names with -im- in the middle where it analyzes it like I'm. (Possibly the im wed id cases should be removed as default exceptions, though. I'd be annoyed at this default personally.)

@adrianeboyd adrianeboyd marked this pull request as draft July 20, 2020 09:08
@honnibal
Copy link
Member

honnibal commented Jul 20, 2020

Ouch :(

Okay I think this will be too breaking, especially to users with special cases. Should we just close this? I don't want to have too many of these obscure options, and the tokenizer is already complex enough, I'd rather not carry unused code.

@adrianeboyd
Copy link
Contributor Author

I would keep the simplified code bits and consider adding an option for the thorough matcher special case handling, since I think it could be useful in some cases to have this be consistent rather than like the current tokenizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / tokenizer Feature: Tokenizer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tokenizer special cases do not work around infix punctuation
2 participants