Skip to content

The ZIMClient.random_article function also returns media files with new zim files #26

@dylanmccall

Description

@dylanmccall

With new-style zim files, both articles and their assets appear in the same "C" namespace:

https://www.openzim.org/wiki/ZIM_file_format#Namespaces

ZIMClient.random_article chooses a random index from the "C" namespace, assuming that entries in that namespace are all articles. This means that it will often return images, for example, instead of articles.

The issue is particularly prominent with this zim file, for example, which contains a very large number of images: https://download.kiwix.org/zim/.hidden/endless/wikihow_en_endless_holidays-and-traditions_2021-12.zim.

For reference, here is how this is implemented in libzim: https://github.com/openzim/libzim/blob/master/src/archive.cpp#L267-L284. It looks like we would need to make use of the title index.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions