Skip to content

mw.xml_dump crashes when encountering LiquidThreads #12

@he7d3r

Description

@he7d3r

When I execute the code

from mw import xml_dump
import sys

def page_info(dump, path):
    for page in dump:
        yield page.id, page.namespace, page.title

for page_id, page_namespace, page_title in xml_dump.map(["ptwikibooks-20140905-pages-meta-current.xml"], page_info):
    print("\t".join([str(page_id), str(page_namespace), page_title]))

I get the following error after a few moments:

35575   0   Geometria descritiva/Introdução
Failed while processing dump 'ptwikibooks-20140905-pages-meta-current.xml': 
Traceback (most recent call last):
  File "<xml_dump>/processor.py", line 35, in run
    for out in self.process_dump(dump, path):
  File "lqt.py", line 7, in page_info
    for page in dump:
  File "<xml_dump>/iteration/iterator.py", line 112, in load_pages
    yield Page.from_element(sub_element)
  File "<xml_dump>/iteration/page.py", line 110, in from_element
    "a <page>: '{0}'".format(tag))
mw.xml_dump.errors.MalformedXML: Unexpected tag found when processing a <page>: 'DiscussionThreading'

35576   0   Geometria descritiva
Traceback (most recent call last):
  File "lqt.py", line 10, in <module>
    for page_id, page_namespace, page_title in xml_dump.map(["ptwikibooks-20140905-pages-meta-current.xml"], page_info):
  File "<xml_dump>/map.py", line 86, in map
    re_raise(error, processor)
  File "<xml_dump>/map.py", line 12, in re_raise
    raise error
mw.xml_dump.errors.MalformedXML: Unexpected tag found when processing a <page>: 'DiscussionThreading'

Trying again, I got a different error

35576   0   Geometria descritiva
Traceback (most recent call last):
  File "lqt.py", line 10, in <module>
    for page_id, page_namespace, page_title in xml_dump.map(["ptwikibooks-20140905-pages-meta-current.xml"], page_info):
  File "<xml_dump>/map.py", line 100, in map
    re_raise(error, path)
NameError: name 'path' is not defined

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions