-
Notifications
You must be signed in to change notification settings - Fork 1
Description
I am documenting this issue but it is low priority. I think it impacts only a few dandisets and through an uncommon use case.
Hard links are rarely used in NWB but they do appear. For example in https://neurosift.app/nwb?url=https://api.dandiarchive.org/api/assets/91b8937b-c303-484a-962e-d0ac1e4ae630/download/&dandisetId=000350&dandisetVersion=draft
/acquisition/GliaOnePhotonSeries and
processing/ophys/VolumeSegmentation/GliaVolumeSegmentation/reference_images/GliaOnePhotonSeries
are hard links to the same (very big) group at the same byte location in the HDF5 file. The chunk references for each of the included datasets (data, timestamps) are identical, and the attributes are identical.
If the LINDI JSON file is generated with all chunk references:
import pynwb
import lindi
h5_url = "https://api.dandiarchive.org/api/assets/91b8937b-c303-484a-962e-d0ac1e4ae630/download/"
# Generate new LINDI file with all chunk references (will take ~60 min and result in 267 MB file)
lindi_opts = lindi.LindiH5ZarrStoreOpts(num_dataset_chunks_threshold=None)
f = lindi.LindiH5pyFile.from_hdf5_file(h5_url, zarr_store_opts=lindi_opts)
f.write_lindi_file('/Users/rly/Downloads/000350_91b8937_full.nwb.lindi.json')
f.close()Then LINDI is unaware that these objects are the same and will write all the keys and chunk references. While this is a faithful representation of the HDF5 file, it would be faster and more space-efficient to encode this in a way that is similar to soft links. It might also be better for caching if both data objects would be accessed since they have different keys (internal paths). I haven't tested this.
We can detect duplicate hard links with code like below:
import h5py
import h5py.h5o as h5o
from collections import defaultdict
def find_hard_links(filename: str) -> dict[int, list[str]]:
hard_links = defaultdict(list)
def visitor(name):
# If the object is a hard link, store the address of the linked-to object
link_type = f.get(name, getlink=True)
if isinstance(link_type, h5py.HardLink):
obj = f.get(name)
info = h5o.get_info(obj.id)
addr = info.addr
hard_links[addr].append(name)
with h5py.File(filename, 'r') as f:
f.visit_links(visitor)
# Filter to only show objects with multiple links
duplicates = {addr: paths for addr, paths in hard_links.items()
if len(paths) > 1}
return duplicates
links = find_hard_links(local_path)
for addr, paths in links.items():
print(f"Address {addr}: {paths}")