Skip to content

Conversation

@samuelbray32
Copy link
Collaborator

@samuelbray32 samuelbray32 commented Aug 27, 2025

Export Improvements

Running the export pipeline for publication I came across two common barriers not previously handled:

  1. Compound restrictions: Table1 & (Table2 & key)
  2. Excessively long keys

This PR solves both with the following:

In _log_fetch:
If either case applies:
- fetch the entry keys of the restricted table
- chunk into groups of entries that make a restriction shorter than the 2048 limit
- make an entry in ExportSelection.Table for each of these chunks.

I think this minimizes the adjustments users need to make to their code for exporting, while preserving the query ability in the ExportSelection entries

Other

  • Fixes Formatting of restriction string in _log_fetch_nwb #1390
  • Fixes Logging restrictions on projected tables #1392
    • If a projected table is restricted, leading to a _log_fetch call:
      • Apply restrictions to projected table
      • undo the projection, bringing column names back to their original values
      • fetch keys in the original heading namespace and log these entries using the new methods above
  • Resolves Globally restrict export to a defined set of nwb_file_name's #1393
    • Adds new intersect method for restrGraph (callable via graph1 & graph2)
    • During export allows user to provide a list of nwb_files_included
      • These are used to make a restriction graph cascading down from Nwbfile and limited to these files
      • This is intersected with the restriction graph generated by ExportSelection to get the vertical database slice only dependent on the selected nwb files
    • Helps prevent inclusion of unintended files captured during the Export Selection process due to non-specific restrictions
  • Misc.
    • Enable multiprocessing during unpacking of linked files
    • Condense restrictions for all ExportSelection.Table entries for a given table prior to creating restrGraph
      • For large project number of leaves that need cascaded goes from ~50k to ~50.

Dandi Export Improvements

In progress

Checklist:

  • If this PR should be accompanied by a release, I have updated the CITATION.cff
  • If this PR edits table definitions, I have included an alter snippet for release notes.
  • If this PR makes changes to position, I ran the relevant tests locally.
  • If this PR makes user-facing changes, I have added/edited docs/notebooks to reflect the changes
  • I have updated the CHANGELOG.md with PR number and description.

@samuelbray32 samuelbray32 requested a review from CBroz1 August 27, 2025 22:25
@samuelbray32
Copy link
Collaborator Author

@CBroz1 I'm leaving this as a draft while I try it out, but would appreciate if you see any issues in the approach when you have time

@codecov
Copy link

codecov bot commented Aug 27, 2025

Codecov Report

❌ Patch coverage is 48.71795% with 180 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.09%. Comparing base (7d39224) to head (11d7349).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
src/spyglass/utils/h5py_helper_fn.py 13.58% 70 Missing ⚠️
src/spyglass/utils/dj_helper_fn.py 10.16% 53 Missing ⚠️
src/spyglass/common/common_dandi.py 6.45% 29 Missing ⚠️
src/spyglass/common/common_usage.py 83.58% 11 Missing ⚠️
src/spyglass/utils/mixins/export.py 82.00% 9 Missing ⚠️
src/spyglass/utils/dj_graph.py 87.30% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1387      +/-   ##
==========================================
+ Coverage   69.79%   70.09%   +0.30%     
==========================================
  Files         104      105       +1     
  Lines       12728    12917     +189     
==========================================
+ Hits         8883     9054     +171     
- Misses       3845     3863      +18     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@CBroz1 CBroz1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together! Seems on track, I just had some questions about subqueries and chunking

Comment on lines +229 to +230
if restriction
else self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have neither a passed restriction nor a self.restriction attr, we might save time by setting the restriction to True and returning early

if not (
(
isinstance(restr_str, str)
and (len(restr_str) > 2048)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still possible we'll bump this 2048 in the future. If so, it would be helpful to have a comment under ExportSelection.Table saying this was hard-coded here

It's also possible to fetch the value with self._export_table.Table.heading.attributes['restriction'].type[8:-1], but I would want to save that as a cached_property like self._export_restr_limit in that case to prevent that kind of table definition fetch on each iteration.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll also add in a raised error in _insert_log if too long a restriction makes it to there.

if restriction
else self
)
restricted_entries = restricted_table.fetch("KEY", log_export=False)
Copy link
Member

@CBroz1 CBroz1 Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to state the logic: If a restriction exceeds our character limit, we will fetch the restricted table and recompile the restriction from chunks of entries. Yeah?

I wanted to confirm that resulting entries were treated as cumulative, so I dug through the export logic -> RestrGraph logic -> _set_restr method ... and yes, _set_restr by default merges passed entries.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. The new entries should be equivalent to those made by accessing the chunks of file entries in a sequential manner.

It loses information about what the restriction in the code was, but the entries specified should be the same

if chunk_size is None:
# estimate appropriate chunk size
chunk_size = max(
int(2048 // (len(restr_str) / len(restricted_entries))), 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kinds of speed gains would we see from increasing the varchar? I would imagine the merging later takes more time than fetching the longer varchar.

Empirical question, likely not worth the deep dive required to answer

i * chunk_size : (i + 1) * chunk_size
]
chunk_restr_str = make_condition(self, chunk_entries, set())
self._insert_log(chunk_restr_str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably a way to make this recursive to avoid ever hitting the varchar limit

def log_fetch(self, restr, ...):
    if len(restr) < self._export_restr_limit:
        self._insert_log(restr)
        return
    all_entries = (self & restr).fetch('KEY', log_export=False)
    for chunk_entries in all_entries[ ... chunk slicing ... ]:
        _ = (self & chunk_entries).fetch("KEY", log_export=True)

This way, if chunking fails for whatever reason, it'll get re-chunked. The embedded case will be smaller than we want, probably slower, but that seems preferable to hitting this error

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable. I'll try something like that It's also still technically possible to hit the limit if the primary key restriction for a single entry is >2048

@samuelbray32 samuelbray32 added the enhancement New feature or request label Sep 8, 2025
@samuelbray32 samuelbray32 marked this pull request as ready for review September 12, 2025 17:54
@edeno edeno requested a review from CBroz1 September 13, 2025 15:14
Copy link
Member

@CBroz1 CBroz1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this work @samuelbray32 - esp the speed gains. Had some questions around motivation, and a desire to avoid globals where we can

), "table_to_undo must be a projection of table"

anti_alias_dict = {
attr.attribute_expression.strip("`"): attr.name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a property for this here, such that you could ...

Suggested change
attr.attribute_expression.strip("`"): attr.name
attr.original_name : attr.name

or just run this dict for all attrs and run the project for a full attr map: {'a':'a', 'orig':'name', 'c':'c'}

if not (
isinstance(restr_str, str)
and (
(len(restr_str) > 2048)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep 2048 hard coded?

"Export cannot handle subquery restrictions. Please submit a "
+ "bug report on GitHub with the code you ran and this"
+ f"restriction:\n\t{restr_str}"
"Single entry restriction exceeds 2048 characters.\n\t"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe adjust to say 'cannot store' so that a user has more context if they hit

)
_ = trodes_pos_v1 * (
common.IntervalList & "interval_list_name = 'pos 0 valid times'"
) # Note for PR: table and restriction change because join no longer logs empty results
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌🏼

(pos_merge & merge_key).fetch_nwb()

ExportSelection.start_export(paper_id=1, analysis_id=5)
trodes_pos_v1._export_cache.clear() # Clear cache to ensure proj table is captured
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something start_export should do?

Copy link
Member

@CBroz1 CBroz1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Work still ongoing?

dandi_api_key: Optional[str] = None,
dandi_instance: Optional[str] = "dandi",
skip_raw_files: Optional[bool] = False,
n_compile_processes: Optional[int] = 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it's safe to assume a user will want n_processes to match across these cases, no? No harm in parsing them out, but I think tighter signatures are more likely to be leveraged

shutil.copy(file, f"{destination_dir}/{os.path.basename(file)}")
else:
os.symlink(file, f"{destination_dir}/{os.path.basename(file)}")
# for file in source_files:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove?

def _make_file_in_dandi_dir(file, destination_dir, skip_raw_files):
if os.path.exists(f"{destination_dir}/{os.path.basename(file)}"):
return
if skip_raw_files and raw_dir in file:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, raw_dir refers to the path fetched from spyglass.settings, yeah? Not now, but I think it's worth adopting a caps convention for those to make it clear they're constants, and not missing func args

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm, you're saying use this for import
from spyglass.settings import raw_dir as RAW_DIR

an propagate changes accordingly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm saying that a future PR should edit settings to raw_dir -> RAW_DIR for the whole codebase. This is fine for now

@@ -0,0 +1,247 @@
from __future__ import annotations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding to h5_helper_fn? Might be confusing to import from both. Or, can rename the existing one to reflect recompute goals

hits: List[str] = []

def _visit(name: str, obj) -> None:
if not isinstance(obj, h5py.Dataset):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the current h5 code is doing this 'if is type, visit' logic, though embedded in the Comparison class



def convert_dataset_type(file: h5py.File, dataset_path: str, target_dtype: str):
"""Convert a dataset to a different dtype 'in place' (-ish).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add note on the 'ish' - what makes it not totally in place?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The nwb file remains in place, but the hdmf dataset inside it is replaced with a new one (which has attributes set to match). There may be a signature left in the hdmf level. I'll clarify in the doc

new_dset.attrs[k] = v


def find_dynamic_tables_missing_id(nwb_path: str | Path) -> List[str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does pipe operator type hinting work if a user is still on 3.9? I'd like to migrate to this, but I think we need to wait until we require 3.10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

3 participants