Skip to content

Optimize storage of hard links that point to the same data #111

@rly

Description

@rly

I am documenting this issue but it is low priority. I think it impacts only a few dandisets and through an uncommon use case.

Hard links are rarely used in NWB but they do appear. For example in https://neurosift.app/nwb?url=https://api.dandiarchive.org/api/assets/91b8937b-c303-484a-962e-d0ac1e4ae630/download/&dandisetId=000350&dandisetVersion=draft

/acquisition/GliaOnePhotonSeries and
processing/ophys/VolumeSegmentation/GliaVolumeSegmentation/reference_images/GliaOnePhotonSeries
are hard links to the same (very big) group at the same byte location in the HDF5 file. The chunk references for each of the included datasets (data, timestamps) are identical, and the attributes are identical.

If the LINDI JSON file is generated with all chunk references:

import pynwb
import lindi

h5_url = "https://api.dandiarchive.org/api/assets/91b8937b-c303-484a-962e-d0ac1e4ae630/download/"

# Generate new LINDI file with all chunk references (will take ~60 min and result in 267 MB file)
lindi_opts = lindi.LindiH5ZarrStoreOpts(num_dataset_chunks_threshold=None)
f = lindi.LindiH5pyFile.from_hdf5_file(h5_url, zarr_store_opts=lindi_opts)
f.write_lindi_file('/Users/rly/Downloads/000350_91b8937_full.nwb.lindi.json')
f.close()

Then LINDI is unaware that these objects are the same and will write all the keys and chunk references. While this is a faithful representation of the HDF5 file, it would be faster and more space-efficient to encode this in a way that is similar to soft links. It might also be better for caching if both data objects would be accessed since they have different keys (internal paths). I haven't tested this.

We can detect duplicate hard links with code like below:

import h5py
import h5py.h5o as h5o
from collections import defaultdict

def find_hard_links(filename: str) -> dict[int, list[str]]:
    hard_links = defaultdict(list)

    def visitor(name):
        # If the object is a hard link, store the address of the linked-to object
        link_type = f.get(name, getlink=True)
        if isinstance(link_type, h5py.HardLink):
            obj = f.get(name)
            info = h5o.get_info(obj.id)
            addr = info.addr
            hard_links[addr].append(name)
    
    with h5py.File(filename, 'r') as f:
        f.visit_links(visitor)
    
    # Filter to only show objects with multiple links
    duplicates = {addr: paths for addr, paths in hard_links.items() 
                  if len(paths) > 1}
    
    return duplicates

links = find_hard_links(local_path)
for addr, paths in links.items():
    print(f"Address {addr}: {paths}")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions