Feature details
I have access to an HPC system, and I wanted to perform some simulation using Pennylane datasets. I could easily see that this is not that easy, as these infrastructures are often not allowed to open ports to the outer internet, hence it's not possible to retrieve the dataset instances during runtime.
This can be easily resolved by using qp.data.load on a login node so to store datasets locally. Nonetheless the problem still remains, as inside qp.data.load there are multiple calls to s3 to check and parse the parameters, and to retrieve the dataset IDs. Even if the data can be stored locally, you are not able to use them if you are not connected to internet.
This can be easily checked with a simple command like:
python -c "import pennylane as qp; qp.data.load('qchem', molname='H2O', bondlength=0.958)"
If you execute this while not connected to internet, the code crashes (see attached logs) even if you have already those data stored.
This is quite uncomfortable for researchers, and was also highlighted in #4376 years ago, but apparently it has not been solved yet.
error_network.log
Implementation
I'd enjoy to work on this, but I first wanted to ask you a feedback about my strategy.
My idea is to cache some metadata locally together with the dataset when downloaded. Most of the modifications would stay inside pennylane/data/data_manager/__init__.py, and the result should be [quite] compatible with earlier APIs.
Each different type of dataset could locally cache these metadata in the folder_path ( "./datasets/" by default) in a hdf5 file (perhaps could be easier to deal with JSON files, but i guess hdf5 is preferred to you). So, there would be qchem.h5, qspin.h5 and so on. The absence of these file would not compromise the standard behavior, and they would be generated with the first dataset download, and later updated.
Inside these metadata holder, we would store:
- the possible attributes for that dataset, so that
_validate_attributes(data_name, attributes) can be executed locally;
- the bindings params -> dataset_id . This is currently provided in runtime from
get_dataset_urls(data_name, params), which is the exact point raising the exception right now. Note that these params must be properly sorted and checked, so that different inputs from qp.data.load are serialized to the same "key" and retrieve the correct ID;
- the bindings dataset_id -> local_path .
With these changes, there should be everything to handle data retrieval locally without interrogating the server. I guess a flag to choose online/offline could be added to qp.data.load, so to help user to have an idea of the possibilities. In the online case, local data are updated.
The only possibly problematic part here is the mappings params -> dataset_id, which is currently performed by the server. While this is easy to deal with when downloading a single dataset at a time -say, one specific molecule with one specific config- where we could store the query params, things change when we download multiple datasets at a time -say, we download all the configurations of the molecule using bondlength='full'. Doing so doesn't retrieve the parameters setting that lead to each specific dataset instance. To solve this:
- there would be need to modify the server answer of
get_dataset_urls(data_name, params) so that dataset_ids_and_urls contains, other than IDs and url, also a list of the parameters configuration for each of the instances
OR
- probably more interesting, I bumped in this map of the server. This could be downloaded and parsed locally. My only doubt is on the parsing modality: is it correct to assume that in
"__params": {
"qchem": [
"molname",
"basis",
"bondlength"
],
"qspin": [
"sysname",
"periodicity",
"lattice",
"layout"
],
"other": [
"name"
]
}
the ordering of the parameters in a specific dataset represents the same key order to be accessed using the concrete data?
In the case of qchem, this question means: is it correct to assume my data are in qchem[molname][basis][bondlength] instead of, say, qchem[basis][bondlength][molname] ?
If this is ensured, than there is no need to modify the server, and the client can safely store everything needed.
How important would you say this feature is?
2: Somewhat important. Needed this quarter.
Additional information
I'd also add a read-only VS write mode to qp.data.load, to solve the problem highlighted in #7235 (default to read-only?) --> leaving as it is, this is also a source of errors in an HPC environment, since multiple SLURM computations can access the dataset at the same time. It is actually less uncommon then what it seems.
Let me know if you think this strategy can work, or if you have other suggestions.
Feature details
I have access to an HPC system, and I wanted to perform some simulation using Pennylane datasets. I could easily see that this is not that easy, as these infrastructures are often not allowed to open ports to the outer internet, hence it's not possible to retrieve the dataset instances during runtime.
This can be easily resolved by using
qp.data.loadon a login node so to store datasets locally. Nonetheless the problem still remains, as insideqp.data.loadthere are multiple calls to s3 to check and parse the parameters, and to retrieve the dataset IDs. Even if the data can be stored locally, you are not able to use them if you are not connected to internet.This can be easily checked with a simple command like:
python -c "import pennylane as qp; qp.data.load('qchem', molname='H2O', bondlength=0.958)"If you execute this while not connected to internet, the code crashes (see attached logs) even if you have already those data stored.
This is quite uncomfortable for researchers, and was also highlighted in #4376 years ago, but apparently it has not been solved yet.
error_network.log
Implementation
I'd enjoy to work on this, but I first wanted to ask you a feedback about my strategy.
My idea is to cache some metadata locally together with the dataset when downloaded. Most of the modifications would stay inside
pennylane/data/data_manager/__init__.py, and the result should be [quite] compatible with earlier APIs.Each different type of dataset could locally cache these metadata in the
folder_path("./datasets/"by default) in a hdf5 file (perhaps could be easier to deal with JSON files, but i guess hdf5 is preferred to you). So, there would beqchem.h5,qspin.h5and so on. The absence of these file would not compromise the standard behavior, and they would be generated with the first dataset download, and later updated.Inside these metadata holder, we would store:
_validate_attributes(data_name, attributes)can be executed locally;get_dataset_urls(data_name, params), which is the exact point raising the exception right now. Note that these params must be properly sorted and checked, so that different inputs fromqp.data.loadare serialized to the same "key" and retrieve the correct ID;With these changes, there should be everything to handle data retrieval locally without interrogating the server. I guess a flag to choose online/offline could be added to
qp.data.load, so to help user to have an idea of the possibilities. In the online case, local data are updated.The only possibly problematic part here is the mappings params -> dataset_id, which is currently performed by the server. While this is easy to deal with when downloading a single dataset at a time -say, one specific molecule with one specific config- where we could store the query params, things change when we download multiple datasets at a time -say, we download all the configurations of the molecule using
bondlength='full'. Doing so doesn't retrieve the parameters setting that lead to each specific dataset instance. To solve this:get_dataset_urls(data_name, params)so thatdataset_ids_and_urlscontains, other than IDs and url, also a list of the parameters configuration for each of the instancesOR
the ordering of the parameters in a specific dataset represents the same key order to be accessed using the concrete data?
In the case of qchem, this question means: is it correct to assume my data are in
qchem[molname][basis][bondlength]instead of, say,qchem[basis][bondlength][molname]?If this is ensured, than there is no need to modify the server, and the client can safely store everything needed.
How important would you say this feature is?
2: Somewhat important. Needed this quarter.
Additional information
I'd also add a read-only VS write mode to
qp.data.load, to solve the problem highlighted in #7235 (default to read-only?) --> leaving as it is, this is also a source of errors in an HPC environment, since multiple SLURM computations can access the dataset at the same time. It is actually less uncommon then what it seems.Let me know if you think this strategy can work, or if you have other suggestions.