Skip to content

Higgs Dataset - ValueError on download_and_prepare() #5428

Open
@zwouter

Description

Short description
The Higgs dataset cannot be used, probably because it contains unexpected missing values.

Environment information

  • Operating System: Windows 11

  • Python version: 3.11.1

  • tensorflow-datasets/tfds-nightly version: tensorflow-datasets 4.9.4

  • tensorflow/tf-nightly version: tensorflow 2.16.1

  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ? Yes.

Reproduction instructions

ds_builder = tfds.builder('higgs')
ds_builder.download_and_prepare()

Logs

Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to C:\Users\WSUIDGEE\tensorflow_datasets\higgs\2.0.0...
Extraction completed...: 0 file [00:00, ? file/s]████████████████████████████████████████| 1/1 [00:00<00:00, 157.03 url/s] 
Dl Size...: 100%|█████████████████████████████████████████████| 2816407858/2816407858 [00:00<00:00, 300620199629.49 MiB/s] 
Dl Completed...: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 96.44 url/s] 
Generating splits...:   0%|                                                                    | 0/1 [00:00<?, ? splits/s] 
Traceback (most recent call last):
  File "c:\Users\WSUIDGEE\Documents\FP\AutoSparse\main.py", line 105, in <module>
    evaluate_configuration(
  File "c:\Users\WSUIDGEE\Documents\FP\AutoSparse\main.py", line 87, in evaluate_configuration
    ds = Dataset(dataset)
         ^^^^^^^^^^^^^^^^
  File "c:\Users\WSUIDGEE\Documents\FP\AutoSparse\datasets.py", line 17, in __init__
    trains_ds, vals_ds, test_ds = self.__load_dataset(dataset_name, k_folds)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\WSUIDGEE\Documents\FP\AutoSparse\datasets.py", line 46, in __load_dataset
    ds_builder.download_and_prepare()
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\logging\__init__.py", line 168, in __call__
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\dataset_builder.py", line 691, in download_and_prepare
    self._download_and_prepare(
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\dataset_builder.py", line 1584, in _download_and_prepare
    future = split_builder.submit_split_generation(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\split_builder.py", line 341, in submit_split_generation
    return self._build_from_generator(**build_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\split_builder.py", line 417, in _build_from_generator
    utils.reraise(e, prefix=f'Failed to encode example:\n{example}\n')
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\split_builder.py", line 415, in _build_from_generator
    example = self._features.encode_example(example)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\features\features_dict.py", line 243, in encode_example
    utils.reraise(
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\features\features_dict.py", line 241, in encode_example
    example[k] = feature.encode_example(example_value)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\WSUIDGEE\Documents\FP\AutoSparse\venv\Lib\site-packages\tensorflow_datasets\core\features\tensor_feature.py", line 175, in encode_example
    example_data = np.array(example_data, dtype=np_dtype)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: Failed to encode example:
{'class_label': '1.000000000000000000e+00', 'lepton_pT': '3.647371232509613037e-01', 'lepton_eta': '1.489144206047058105e+00', 'lepton_phi': '3.394368290901184082e-01', 'missing_energy_magnitude': '1.493860602378845215e+00', 'missing_energy_phi': '-1.723330497741699219e+00', 'jet_1_pt': '7.524616718292236328e-01', 'jet_1_eta': '-2.802605032920837402e-01', 'jet_1_phi': '-4.207125604152679443e-01', 'jet_1_b-tag': '2.173076152801513672e+00', 'jet_2_pt': '', 'jet_2_eta': None, 'jet_2_phi': None, 'jet_2_b-tag': None, 'jet_3_pt': None, 'jet_3_eta': None, 'jet_3_phi': None, 'jet_3_b-tag': None, 'jet_4_pt': None, 'jet_4_eta': None, 'jet_4_phi': None, 'jet_4_b-tag': None, 'm_jj': None, 'm_jjj': None, 'm_lv': None, 'm_jlv': None, 'm_bb': None, 'm_wbb': None, 'm_wwbb': None}
In <Tensor> with name "jet_2_pt":
could not convert string to float: ''

Expected behavior
I expect the dataset to be downloaded and prepared such that I can quickly load it in the future.

Additional context
I am new to using tfds, but other datasets (e.g. MNIST, CIFAR10) work as intended.
The dataset is not supposed to have missing values, according to https://archive.ics.uci.edu/dataset/280/higgs

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions