- 
                Notifications
    You must be signed in to change notification settings 
- Fork 66
Open
Description
I encountered an AttributeError during the data preprocessing step. The error occurs after the process has successfully written over 112 million lines. Here is the terminal output:
...
112000000 lines written in 3.91s. 
112200000 lines written in 4.43s. 
112400000 lines written in 4.69s. 
Done! Read 112473858 lines
Process Process-2:
Traceback (most recent call last):
  File "/home/szu/miniconda3/envs/tog/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/szu/miniconda3/envs/tog/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/szu/code/kg_project/ToG/Wikidata/simple_wikidata_db/preprocess_utils/writer_process.py", line 89, in write_data
    writer.close()
  File "/home/szu/code/kg_project/ToG/Wikidata/simple_wikidata_db/preprocess_utils/writer_process.py", line 79, in close
    v.close()
  File "/home/szu/code/kg_project/ToG/Wikidata/simple_wikidata_db/preprocess_utils/writer_process.py", line 51, in close
    self.cur_file_writer.close()
    ^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'close'
Finished processing 112473858 in 4191.5129499435425s
Reproduction Steps:
- Run the data preprocessing script.
python3 preprocess_dump.py --input_file ./latest-all.json.gz --out_dir ./wiki_process 
- Observe the terminal output as the script processes the data.
Additional Questions:
- I am not sure if this error affects subsequent steps. Can someone confirm?
- The disk has remaining space, but is it necessary to have 1024G of space? Can the process be tested with a subset of the dataset?
- If a subset can be used for testing, could you please provide instructions on how to do so?
Thank you in advance for your assistance and for any insights you may offer regarding these queries. Your help is greatly appreciated.
Metadata
Metadata
Assignees
Labels
No labels