Allow `qp.data.load` to work fully offline

### Feature details

I have access to an HPC system, and I wanted to perform some simulation using Pennylane datasets. I could easily see that this is not that easy, as these infrastructures are often not allowed to open ports to the outer internet, hence it's not possible to retrieve the dataset instances during runtime. 

This can be easily resolved by using `qp.data.load` on a login node so to store datasets locally. Nonetheless the problem still remains, as inside `qp.data.load` there are multiple calls to s3 to check and parse the parameters, and to retrieve the dataset IDs. Even if the data can be stored locally, you are not able to use them if you are not connected to internet.
This can be easily checked with a simple command like: 
`python -c "import pennylane as qp; qp.data.load('qchem', molname='H2O', bondlength=0.958)"`
If you execute this while not connected to internet, the code crashes (see attached logs) even if you have already those data stored. 
This is quite uncomfortable for researchers, and was also highlighted in #4376 years ago, but apparently it has not been solved yet.



[error_network.log](https://github.com/user-attachments/files/29096386/error_network.log)

### Implementation

 I'd enjoy to work on this, but I first wanted to ask you a feedback about my strategy.

My idea is to cache some metadata locally together with the dataset when downloaded. Most of the modifications would stay inside  `pennylane/data/data_manager/__init__.py`, and the result should be [quite] compatible with earlier APIs.
Each different type of dataset could locally cache these metadata in the `folder_path` ( `"./datasets/"` by default) in a hdf5 file (perhaps could be easier to deal with JSON files, but i guess hdf5 is preferred to you). So, there would be  `qchem.h5`,  `qspin.h5` and so on. The absence of these file would not compromise the standard behavior, and they would be generated with the first dataset download, and later updated.

Inside these metadata holder, we would store:
+  the possible _attributes_ for that dataset, so that `_validate_attributes(data_name, attributes)` can be executed locally;
+ the bindings _params -> dataset_id_ . This is currently provided in runtime from `get_dataset_urls(data_name, params)`, which is the exact point raising the exception right now. Note that these params must be properly sorted and checked, so that different inputs from `qp.data.load` are serialized to the same "key" and retrieve the correct ID;
+ the bindings _dataset_id -> local_path_ .

With these changes, there should be everything to handle data retrieval locally without interrogating the server. I guess a flag to choose online/offline could be added to `qp.data.load`, so to help user to have an idea of the possibilities. In the online case, local data are updated.

The only possibly problematic part here is the mappings _params -> dataset_id_, which is currently performed by the server.  While this is easy to deal with when downloading a single dataset at a time -say, one specific molecule with one specific config- where we could store the query params, things change when we download multiple datasets at a time -say, we download all the configurations of the molecule using `bondlength='full'`. Doing so doesn't retrieve the parameters setting that lead to each specific dataset instance. To solve this:
- there would be need to modify the server  answer of `get_dataset_urls(data_name, params)` so that `dataset_ids_and_urls` contains, other than IDs and url, also a list of the parameters configuration for each of the instances

OR
- probably more interesting, I bumped in [this](https://datasets.cloud.pennylane.ai/datasets/h5/foldermap.json) map of the server. This could be downloaded and parsed locally. My only doubt is on the parsing modality: is it correct to assume that in
```
"__params": {
    "qchem": [
      "molname",
      "basis",
      "bondlength"
    ],
    "qspin": [
      "sysname",
      "periodicity",
      "lattice",
      "layout"
    ],
    "other": [
      "name"
    ]
  }
```
the ordering of the parameters in a specific dataset represents the same key order to be accessed using the concrete data? 
In the case of qchem, this question means: is it correct to assume my data are in `qchem[molname][basis][bondlength]` instead of, say,  `qchem[basis][bondlength][molname]` ?
If this is ensured, than there is no need to modify the server, and the client can safely store everything needed.

### How important would you say this feature is?

2: Somewhat important. Needed this quarter.

### Additional information

I'd also add a _read-only_ VS _write_ mode to `qp.data.load`, to solve the problem highlighted  in #7235 (default to read-only?) --> leaving as it is, this is also a source of errors in an HPC environment, since multiple SLURM computations can access the dataset at the same time. It is actually less uncommon then what it seems.

Let me know if you think this strategy can work, or if you have other suggestions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow `qp.data.load` to work fully offline #9699

Feature details

Implementation

How important would you say this feature is?

Additional information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Allow qp.data.load to work fully offline #9699

Description

Feature details

Implementation

How important would you say this feature is?

Additional information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Allow `qp.data.load` to work fully offline #9699