Skip to content

max_length of nlp pipline for e.g. japanese #13207

@JWittmeyer

Description

@JWittmeyer

Not sure if this is meant to happen or a misunderstanding on my part. I'm assuming misunderstanding so I'm going for Documentation Report.

The Language (nlp) class has a max_length parameter that seems to work different for e.g. japanese.

I'm currently trying to chunk texts that are too long by considering the max_length and splitting based on that. For e.g. english texts this seems to work without any issues.

Basic approach code:

if len(content) > nlp.max_length:
    for chunk in __chunk_text(content, nlp.max_length-100):
        doc = nlp(chunk)
        #....    

However for the config string ja_core_news_sm this doesn't work.
After a bit of analyzation i noticed that not the length but the byte amount needs to be considered.

def __utf8len(s:str):
    return len(s.encode('utf-8'))

if __utf8len(content) > nlp.max_length:
    #...

However even with the byte approach i run into an error that looks like it's max_length related but maybe not really?

Slightly reduced Error trace:

    doc = nlp(content)
  File "/usr/local/lib/python3.9/site-packages/spacy/language.py", line 1014, in __call__
    doc = self._ensure_doc(text)
  File "/usr/local/lib/python3.9/site-packages/spacy/language.py", line 1105, in _ensure_doc
    return self.make_doc(doc_like)
  File "/usr/local/lib/python3.9/site-packages/spacy/language.py", line 1097, in make_doc
    return self.tokenizer(text)
  File "/usr/local/lib/python3.9/site-packages/spacy/lang/ja/__init__.py", line 56, in __call__
    sudachipy_tokens = self.tokenizer.tokenize(text)
Exception: Tokenization error: Input is too long, it can't be more than 49149 bytes, was 63960

I also double checked the values for max_length (1000000), string length (63876) & byte length(63960)
Setting the max_length by hand to 1100000 didn't change the error message so I'm assuming something else (maybe sudachi itself?) defines the Input is too long error message.

What the actual issue is and how to solve it (for lookup size limits) would be great for the documentation.

Which page or section is this issue related to?

Not sure where to add since it I'm not sure if it's directly japanese related. However a note might be interesting at https://spacy.io/models/ja or https://spacy.io/usage/models#japanese.

Further a note for max_length in general might need extension (if correctly assumed maybe something like character length isn't the classic python len(<string>) function but the byte size (e.g. letter "I" - len 1 - byte 1 & kanji "私" - len 1 - byte 3)

Metadata

Metadata

Assignees

No one assigned

    Labels

    docsDocumentation and websitelang / jaJapanese language data and models

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions