Optimize storage of hard links that point to the same data

I am documenting this issue but it is low priority. I think it impacts only a few dandisets and through an uncommon use case.

Hard links are rarely used in NWB but they do appear. For example in https://neurosift.app/nwb?url=https://api.dandiarchive.org/api/assets/91b8937b-c303-484a-962e-d0ac1e4ae630/download/&dandisetId=000350&dandisetVersion=draft

`/acquisition/GliaOnePhotonSeries` and 
`processing/ophys/VolumeSegmentation/GliaVolumeSegmentation/reference_images/GliaOnePhotonSeries` 
are hard links to the same (very big) group at the same byte location in the HDF5 file. The chunk references for each of the included datasets (data, timestamps) are identical, and the attributes are identical. 

If the LINDI JSON file is generated with all chunk references:
```python
import pynwb
import lindi

h5_url = "https://api.dandiarchive.org/api/assets/91b8937b-c303-484a-962e-d0ac1e4ae630/download/"

# Generate new LINDI file with all chunk references (will take ~60 min and result in 267 MB file)
lindi_opts = lindi.LindiH5ZarrStoreOpts(num_dataset_chunks_threshold=None)
f = lindi.LindiH5pyFile.from_hdf5_file(h5_url, zarr_store_opts=lindi_opts)
f.write_lindi_file('/Users/rly/Downloads/000350_91b8937_full.nwb.lindi.json')
f.close()
```

Then LINDI is unaware that these objects are the same and will write all the keys and chunk references. While this is a faithful representation of the HDF5 file, it would be faster and more space-efficient to encode this in a way that is similar to [soft links](https://github.com/NeurodataWithoutBorders/lindi/blob/main/docs/special_zarr_annotations.md#soft-links). It might also be better for caching if both data objects would be accessed since they have different keys (internal paths). I haven't tested this.

We can detect duplicate hard links with code like below:
```python
import h5py
import h5py.h5o as h5o
from collections import defaultdict

def find_hard_links(filename: str) -> dict[int, list[str]]:
    hard_links = defaultdict(list)

    def visitor(name):
        # If the object is a hard link, store the address of the linked-to object
        link_type = f.get(name, getlink=True)
        if isinstance(link_type, h5py.HardLink):
            obj = f.get(name)
            info = h5o.get_info(obj.id)
            addr = info.addr
            hard_links[addr].append(name)
    
    with h5py.File(filename, 'r') as f:
        f.visit_links(visitor)
    
    # Filter to only show objects with multiple links
    duplicates = {addr: paths for addr, paths in hard_links.items() 
                  if len(paths) > 1}
    
    return duplicates

links = find_hard_links(local_path)
for addr, paths in links.items():
    print(f"Address {addr}: {paths}")
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize storage of hard links that point to the same data #111

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize storage of hard links that point to the same data #111

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions