Open
Description
openzim/mwoffliner#2318 exhibit an issue around Xapian.
Digging deeper, I narrowed down the problem:
- it is linked to item title
- when item title is 122 times or less the
ي
character, everything is fine - when item title is 123 times or more the
ي
character (or any other 2 bytes UTF-8 character), we get the Xapian error - when item title is 82 times or more the
ࠄ
character (or any other 3 bytes UTF-8 character), we get the Xapian error
I reproduce the issue both with python-libzim and node-libzim.
Here is minimalist Python code snippet reproducing the error with a 2 bytes character (it will fail when i
is 123)
from zimscraperlib.zim import Creator
from pathlib import Path
from zimscraperlib.zim import metadata
creator = Creator(Path("tests.zim"), "index.html").config_metadata(
std_metadata=metadata.DEFAULT_DEV_ZIM_METADATA
)
# start creator early to detect any problem early as well
creator.start()
creator.set_mainpath("index")
creator.add_item_for("index", "Main Page", content="any", is_front=True )
for i in range(256):
print(i)
path = f"path{i}"
title = "ي" * i
creator.add_item_for(path, title, content="any", is_front=True )
creator.finish()
I will implement an interim fix in mwoffliner, but we probably need to either fix this issue or document this limitation (if not already done, I might have missed it).