-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Description
Not sure if this is meant to happen or a misunderstanding on my part. I'm assuming misunderstanding so I'm going for Documentation Report.
The Language (nlp) class has a max_length parameter that seems to work different for e.g. japanese.
I'm currently trying to chunk texts that are too long by considering the max_length and splitting based on that. For e.g. english texts this seems to work without any issues.
Basic approach code:
if len(content) > nlp.max_length:
for chunk in __chunk_text(content, nlp.max_length-100):
doc = nlp(chunk)
#.... However for the config string ja_core_news_sm this doesn't work.
After a bit of analyzation i noticed that not the length but the byte amount needs to be considered.
def __utf8len(s:str):
return len(s.encode('utf-8'))
if __utf8len(content) > nlp.max_length:
#...However even with the byte approach i run into an error that looks like it's max_length related but maybe not really?
Slightly reduced Error trace:
doc = nlp(content)
File "/usr/local/lib/python3.9/site-packages/spacy/language.py", line 1014, in __call__
doc = self._ensure_doc(text)
File "/usr/local/lib/python3.9/site-packages/spacy/language.py", line 1105, in _ensure_doc
return self.make_doc(doc_like)
File "/usr/local/lib/python3.9/site-packages/spacy/language.py", line 1097, in make_doc
return self.tokenizer(text)
File "/usr/local/lib/python3.9/site-packages/spacy/lang/ja/__init__.py", line 56, in __call__
sudachipy_tokens = self.tokenizer.tokenize(text)
Exception: Tokenization error: Input is too long, it can't be more than 49149 bytes, was 63960
I also double checked the values for max_length (1000000), string length (63876) & byte length(63960)
Setting the max_length by hand to 1100000 didn't change the error message so I'm assuming something else (maybe sudachi itself?) defines the Input is too long error message.
What the actual issue is and how to solve it (for lookup size limits) would be great for the documentation.
Which page or section is this issue related to?
Not sure where to add since it I'm not sure if it's directly japanese related. However a note might be interesting at https://spacy.io/models/ja or https://spacy.io/usage/models#japanese.
Further a note for max_length in general might need extension (if correctly assumed maybe something like character length isn't the classic python len(<string>) function but the byte size (e.g. letter "I" - len 1 - byte 1 & kanji "私" - len 1 - byte 3)