Skip to content

【Wikidata preprocess_dump error】AttributeError when closing writer during data preprocessing #32

@YYForReal

Description

@YYForReal

I encountered an AttributeError during the data preprocessing step. The error occurs after the process has successfully written over 112 million lines. Here is the terminal output:

...
112000000 lines written in 3.91s. 
112200000 lines written in 4.43s. 
112400000 lines written in 4.69s. 
Done! Read 112473858 lines
Process Process-2:
Traceback (most recent call last):
  File "/home/szu/miniconda3/envs/tog/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/szu/miniconda3/envs/tog/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/szu/code/kg_project/ToG/Wikidata/simple_wikidata_db/preprocess_utils/writer_process.py", line 89, in write_data
    writer.close()
  File "/home/szu/code/kg_project/ToG/Wikidata/simple_wikidata_db/preprocess_utils/writer_process.py", line 79, in close
    v.close()
  File "/home/szu/code/kg_project/ToG/Wikidata/simple_wikidata_db/preprocess_utils/writer_process.py", line 51, in close
    self.cur_file_writer.close()
    ^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'close'
Finished processing 112473858 in 4191.5129499435425s

Reproduction Steps:

  1. Run the data preprocessing script.
python3 preprocess_dump.py --input_file ./latest-all.json.gz --out_dir ./wiki_process 
  1. Observe the terminal output as the script processes the data.

Additional Questions:

  • I am not sure if this error affects subsequent steps. Can someone confirm?
  • The disk has remaining space, but is it necessary to have 1024G of space? Can the process be tested with a subset of the dataset?
  • If a subset can be used for testing, could you please provide instructions on how to do so?

Thank you in advance for your assistance and for any insights you may offer regarding these queries. Your help is greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions