Apply special cases to tokens split by infixes #5772

adrianeboyd · 2020-07-17T07:30:49Z

Description

Combine methods for special case and cache checking to simplify code
Check for special cases after splitting on infixes

Types of change

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

* Combine methods for special case and cache checking to simplify code * Check for special cases after splitting on infixes

honnibal · 2020-07-19T11:30:45Z

Looks good, and it'll be nice to have this inconsistency resolved. The infixes weren't part of the original tokenizer design so I was never 100% sure how they should interact with the special-cases.

Have you evaluated the change to check what tokenizes differently in a sample of text? I don't think the runtime will be affected but it's good to check that quickly as well.

I'm also a little worried about how breaking this might be for people's tokenization rules. It's awkward to add an option, but if this makes a big difference I suppose that's something we could consider too.

honnibal

I'll put the "needs evaluation (and maybe benchmark)" under a 'request changes' mark to remind myself not to merge this yet.

adrianeboyd · 2020-07-19T12:18:04Z

I benchmarked the speed in advance (no measurable difference), but you're right that it needs a more thorough evaluation for all the training corpora. (I think it's going to be rare enough that it's hard to find cases where it makes a difference.)

The other option is to add an option to extend the matcher-based special case handling. Right now the matcher-based special cases are only those that contain affix punctuation (for speed), but you could add an option to use the matcher-based handling for all the special cases, which would address this. It would be slower overall, but would potentially be more consistent. I'm not sure whether I'd want fast-without-infixes and slow-with-infixes to be the setting, though. The way the matcher special cases work, though, you couldn't get slow-without-infixes, though, since it matches over all the tokens, so you could get fast-with-infixes, fast-without-infixes, or slow-with-infixes, if that makes sense.

adrianeboyd · 2020-07-20T08:52:01Z

Wow, I was really wrong about this:

	current	this PR
da-core-news	99.85	99.84
de-dep-news	99.97	99.85
el-dep-news	99.98	99.97
en-core-web	99.92	99.43
es-dep-news	99.99	99.99
fr-dep-news	99.82	99.16
it-dep-news	99.93	98.10
lt-dep-news	100	100
nb-core-news	99.8	99.78
nl-dep-news	99.96	99.96
pl-dep-sz	100	99.91
pt-dep-news	99.93	99.86
ro-dep-mixed	99.84	99.84

I guess we could have a slow-with-infixes option for people who really wanted the more consistent behavior in a particular instance, but I don't think we should make this the default.

I ran it on some English wikipedia text and in a lot of cases it's better, like contractions next to emdash or / or ellipsis, but there are some like German names with -im- in the middle where it analyzes it like I'm. (Possibly the im wed id cases should be removed as default exceptions, though. I'd be annoyed at this default personally.)

honnibal · 2020-07-20T12:52:09Z

Ouch :(

Okay I think this will be too breaking, especially to users with special cases. Should we just close this? I don't want to have too many of these obscure options, and the tokenizer is already complex enough, I'd rather not carry unused code.

adrianeboyd · 2020-07-20T12:56:40Z

I would keep the simplified code bits and consider adding an option for the thorough matcher special case handling, since I think it could be useful in some cases to have this be consistent rather than like the current tokenizer.

Apply special cases to tokens split by infixes

cc38d46

* Combine methods for special case and cache checking to simplify code * Check for special cases after splitting on infixes

adrianeboyd added enhancement Feature requests and improvements feat / tokenizer Feature: Tokenizer labels Jul 17, 2020

svlandeg linked an issue Jul 17, 2020 that may be closed by this pull request

Tokenizer special cases do not work around infix punctuation #5598

Open

honnibal requested changes Jul 19, 2020

View reviewed changes

adrianeboyd marked this pull request as draft July 20, 2020 09:08

adrianeboyd mentioned this pull request Sep 2, 2020

Refactor specials and cache checks in tokenizer #6012

Merged

3 tasks

adrianeboyd closed this Sep 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Apply special cases to tokens split by infixes #5772

Apply special cases to tokens split by infixes #5772

Uh oh!

adrianeboyd commented Jul 17, 2020

Uh oh!

honnibal commented Jul 19, 2020

Uh oh!

honnibal left a comment

Uh oh!

adrianeboyd commented Jul 19, 2020

Uh oh!

adrianeboyd commented Jul 20, 2020

Uh oh!

honnibal commented Jul 20, 2020 •

edited

Loading

Uh oh!

adrianeboyd commented Jul 20, 2020

Uh oh!

Uh oh!

Uh oh!

Apply special cases to tokens split by infixes #5772

Apply special cases to tokens split by infixes #5772

Uh oh!

Conversation

adrianeboyd commented Jul 17, 2020

Description

Types of change

Checklist

Uh oh!

honnibal commented Jul 19, 2020

Uh oh!

honnibal left a comment

Choose a reason for hiding this comment

Uh oh!

adrianeboyd commented Jul 19, 2020

Uh oh!

adrianeboyd commented Jul 20, 2020

Uh oh!

honnibal commented Jul 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrianeboyd commented Jul 20, 2020

Uh oh!

Uh oh!

honnibal commented Jul 20, 2020 •

edited

Loading