Description
Short description
The Higgs dataset cannot be used, probably because it contains unexpected missing values.
Environment information
-
Operating System: Windows 11
-
Python version: 3.11.1
-
tensorflow-datasets
/tfds-nightly
version: tensorflow-datasets 4.9.4 -
tensorflow
/tf-nightly
version: tensorflow 2.16.1 -
Does the issue still exists with the last
tfds-nightly
package (pip install --upgrade tfds-nightly
) ? Yes.
Reproduction instructions
ds_builder = tfds.builder('higgs')
ds_builder.download_and_prepare()
Logs
Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to C:\Users\WSUIDGEE\tensorflow_datasets\higgs\2.0.0...
Extraction completed...: 0 file [00:00, ? file/s]████████████████████████████████████████| 1/1 [00:00<00:00, 157.03 url/s]
Dl Size...: 100%|█████████████████████████████████████████████| 2816407858/2816407858 [00:00<00:00, 300620199629.49 MiB/s]
Dl Completed...: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 96.44 url/s]
Generating splits...: 0%| | 0/1 [00:00<?, ? splits/s]
Traceback (most recent call last):
File "c:\Users\WSUIDGEE\Documents\FP\AutoSparse\main.py", line 105, in <module>
evaluate_configuration(
File "c:\Users\WSUIDGEE\Documents\FP\AutoSparse\main.py", line 87, in evaluate_configuration
ds = Dataset(dataset)
^^^^^^^^^^^^^^^^
File "c:\Users\WSUIDGEE\Documents\FP\AutoSparse\datasets.py", line 17, in __init__
trains_ds, vals_ds, test_ds = self.__load_dataset(dataset_name, k_folds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\WSUIDGEE\Documents\FP\AutoSparse\datasets.py", line 46, in __load_dataset
ds_builder.download_and_prepare()
File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\logging\__init__.py", line 168, in __call__
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\dataset_builder.py", line 691, in download_and_prepare
self._download_and_prepare(
File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\dataset_builder.py", line 1584, in _download_and_prepare
future = split_builder.submit_split_generation(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\split_builder.py", line 341, in submit_split_generation
return self._build_from_generator(**build_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\split_builder.py", line 417, in _build_from_generator
utils.reraise(e, prefix=f'Failed to encode example:\n{example}\n')
File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\split_builder.py", line 415, in _build_from_generator
example = self._features.encode_example(example)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\features\features_dict.py", line 243, in encode_example
utils.reraise(
File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\features\features_dict.py", line 241, in encode_example
example[k] = feature.encode_example(example_value)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\features\tensor_feature.py", line 175, in encode_example
example_data = np.array(example_data, dtype=np_dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: Failed to encode example:
{'class_label': '1.000000000000000000e+00', 'lepton_pT': '3.647371232509613037e-01', 'lepton_eta': '1.489144206047058105e+00', 'lepton_phi': '3.394368290901184082e-01', 'missing_energy_magnitude': '1.493860602378845215e+00', 'missing_energy_phi': '-1.723330497741699219e+00', 'jet_1_pt': '7.524616718292236328e-01', 'jet_1_eta': '-2.802605032920837402e-01', 'jet_1_phi': '-4.207125604152679443e-01', 'jet_1_b-tag': '2.173076152801513672e+00', 'jet_2_pt': '', 'jet_2_eta': None, 'jet_2_phi': None, 'jet_2_b-tag': None, 'jet_3_pt': None, 'jet_3_eta': None, 'jet_3_phi': None, 'jet_3_b-tag': None, 'jet_4_pt': None, 'jet_4_eta': None, 'jet_4_phi': None, 'jet_4_b-tag': None, 'm_jj': None, 'm_jjj': None, 'm_lv': None, 'm_jlv': None, 'm_bb': None, 'm_wbb': None, 'm_wwbb': None}
In <Tensor> with name "jet_2_pt":
could not convert string to float: ''
Expected behavior
I expect the dataset to be downloaded and prepared such that I can quickly load it in the future.
Additional context
I am new to using tfds, but other datasets (e.g. MNIST, CIFAR10) work as intended.
The dataset is not supposed to have missing values, according to https://archive.ics.uci.edu/dataset/280/higgs
Activity