max_length of nlp pipline for e.g. japanese

Not sure if this is meant to happen or a misunderstanding on my part. I'm assuming misunderstanding so I'm going for Documentation Report.

The Language (nlp) class has a `max_length` parameter that seems to work different for e.g. japanese.

I'm currently trying to chunk texts that are too long by considering the max_length and splitting based on that. For e.g. english texts this seems to work without any issues.

Basic approach code:
```python
if len(content) > nlp.max_length:
    for chunk in __chunk_text(content, nlp.max_length-100):
        doc = nlp(chunk)
        #....    
```

However for the config string `ja_core_news_sm` this doesn't work. 
After a bit of analyzation i noticed that not the length but the byte amount needs to be considered.

```python
def __utf8len(s:str):
    return len(s.encode('utf-8'))

if __utf8len(content) > nlp.max_length:
    #...
```

However even with the byte approach i run into an error that looks like it's max_length related but maybe not really?

Slightly reduced Error trace: 
```
    doc = nlp(content)
  File "/usr/local/lib/python3.9/site-packages/spacy/language.py", line 1014, in __call__
    doc = self._ensure_doc(text)
  File "/usr/local/lib/python3.9/site-packages/spacy/language.py", line 1105, in _ensure_doc
    return self.make_doc(doc_like)
  File "/usr/local/lib/python3.9/site-packages/spacy/language.py", line 1097, in make_doc
    return self.tokenizer(text)
  File "/usr/local/lib/python3.9/site-packages/spacy/lang/ja/__init__.py", line 56, in __call__
    sudachipy_tokens = self.tokenizer.tokenize(text)
Exception: Tokenization error: Input is too long, it can't be more than 49149 bytes, was 63960
```

I also double checked the values for max_length (1000000), string length (63876) & byte length(63960)
Setting the max_length by hand to 1100000 didn't change the error message so I'm assuming something else (maybe sudachi itself?) defines the Input is too long error message. 

What the actual issue is and how to solve it (for lookup size limits) would be great for the documentation.


## Which page or section is this issue related to?

Not sure where to add since it I'm not sure if it's directly japanese related. However a note might be interesting at https://spacy.io/models/ja or https://spacy.io/usage/models#japanese. 

Further a note for [max_length](https://spacy.io/api/language#init) in general might need extension (if correctly assumed maybe something like `character length isn't the classic python len(<string>) function but the byte size (e.g. letter "I" - len 1 - byte 1 & kanji "私" - len 1 - byte 3)`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

max_length of nlp pipline for e.g. japanese #13207

Which page or section is this issue related to?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

max_length of nlp pipline for e.g. japanese #13207

Description

Which page or section is this issue related to?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions