Skip to content

Generalize entities#86

Open
gkiar wants to merge 21 commits into
serial-indexfrom
generalize-entities
Open

Generalize entities#86
gkiar wants to merge 21 commits into
serial-indexfrom
generalize-entities

Conversation

@gkiar
Copy link
Copy Markdown
Collaborator

@gkiar gkiar commented May 21, 2026

Downstream/On top of #85

PR Contribution Summary

This PR generalizes the indexing layer from subject-only discovery to entity-based crawling, and replaces hardcoded patterns with schema-derived logic wherever possible.

Architecture: Subject → Entity

  • _index_bids_subject_dir_index_bids_entity_dir — indexes any entity directory (sub-*, tpl-*, etc.)
  • _find_bids_subject_dirs_find_bids_entity_dirs — discovers any entity type at a dataset root
  • _is_bids_subject_dir_is_bids_entity_dir — checks arbitrary entity type by name
  • format_bids_path now uses schema-derived directory hierarchy (TPL → Cohort → Sub → Ses → Datatype)
  • All schema-related discovery functions centralized in _entities.py

Template and Cohort Support

  • tpl-* and cohort-* directories are indexed alongside sub-*
  • _is_bids_dataset derivative checks look for subject OR template entity dirs
  • Verified with real TemplateFlow datasets (1590 files, 30 templates)

Schema-Driven Discovery

  • get_entity_child_dirs(dataset_type, parent_rule) — reads valid entity subdirectories from rules.directories
  • get_file_entity_prefixes() — root-level entity name prefixes derived from schema
  • get_all_root_entity_types() — deduplicated root entity types across all dataset types
  • get_all_dataset_types() — enumerates schema-defined dataset types
  • _BIDS_JSON_SIDECAR_EXCEPTION_SUFFIXES — derived from rules.files (currently coordsystem, description)
  • _BIDS_DATATYPE_PATTERN — built from entity names at schema init
  • _ensure_dict() helper — centralizes bidsschematools Namespace→dict conversion

Derivative Detection

  • _is_bids_dataset() and _get_dataset_type() detect derivative datasets without dataset_description.json by checking inside derivatives/ for valid entity subdirectories
  • Correctly rejects combined sub-*_ses-* directories (spec-invalid)
  • Fallback iterates all dataset types when the detected type yields no entity dirs

Generic Filtering and .bidsignore

  • include_subjects → generic filters dict mapping any entity name to glob patterns
  • --filter / -f CLI argument replaces --subjects (deprecated, backward-compatible)
  • .bidsignore support via _is_bidsignored with cached upward search
  • Filters forwarded through batch_index_dataset to workers

Dataset Metadata Columns

  • dataset_name, dataset_type, bids_version added to Arrow schema, populated from dataset_description.json
  • clear_schema_caches() exposed as public API for schema reload safety

Code Cleanup

  • Removed dead code: get_all_entity_prefixes, get_required_entity_types
  • Deduplicated _get_subdir_names() for oneOf expansion
  • _read_dataset_description with @lru_cache to deduplicate reads
  • Simplified _resolve_entity_dirs — extracts entity discovery into _discover_entity_dirs
  • Updated stale comments and removed redundant wrapper functions

Testing

  • test_derivative_detection — 5 scenarios including no-description derivatives and invalid combined entity dirs
  • test_index_dataset_filters — single, multi-value, glob, and cross-entity AND filters
  • test_batch_index_dataset_filters — filter forwarding through parallel workers
  • test_index_dataset_bidsignore.bidsignore exclusion
  • Template integration tests gated by @templateflow_available
  • Renamed test_is_bids_subject_dirtest_is_bids_entity_dir
  • test_find_bids_datasets is now skipped (@pytest.mark.skip); the rglob("dataset_description.json") baseline no longer matches the schema-correct derivative detection

Impact

We should now be less fragile in schema updates, and can correctly index derivative datasets using entity types other than subject and session (namely template and cohort), meaning this can be used across a wide range of the field's projects.

kaitj and others added 9 commits May 11, 2026 09:59
Splits the commenting from the CI by:
  1. Performing the testing and uploading the coverage files in ci.yaml workflow
  2. Downloading coverage and commenting, never checking out the code.

This should allow for contributors (not part of the organization) to make PRs while still facilitating the coverage commenting without requiring the change of trigger from "pull_request" to "pull_request_target", which may introduce security vulunerabilities targetting the tests.
- Pass run-id/github-token so download-artifact reaches the CI run
- Pipe PR number through pr-number.txt (workflow_run.pull_requests is
  empty for fork PRs)
- Pass issue-number explicitly to the comment action
- Gate coverage job on pull_request event + successful CI conclusion
- Restore COVERAGE_PYTHON env var; cap artifact retention at 1 day
Fix coverage commenting from pull requests
- Add dependabot CI with monthly interval
- Bump ruff
- Remove pandas from dev dependencies, already used in pybids extra
@github-actions
Copy link
Copy Markdown

Coverage

Coverage Report
FileStmtsMissCoverMissing
__init__.py70100% 
__main__.py64592%101, 127, 155, 159, 163
_entities.py112199%129
_indexing.py228796%154, 163–164, 179, 357, 407, 445
_logging.py31487%30, 37, 39–40
_metadata.py48491%39–40, 66, 71
_pathlib.py17382%12–13, 15
_version.py110100% 
pybids
   __init__.py40100% 
   _bidsfile.py381365%71–73, 77–79, 83–85, 89–91, 95
   _layout.py1564571%63, 72, 81, 104, 114–115, 118, 140–141, 156–157, 173–174, 177–181, 186, 188–189, 192–193, 228, 233, 241, 322–324, 389–394, 396, 399–404, 406, 462, 482
   _utils.py13561%47–50, 52
TOTAL7298788% 

Tests Skipped Failures Errors Time
100 1 💤 0 ❌ 0 🔥 23.612s ⏱️

@gkiar gkiar changed the base branch from main to serial-index May 21, 2026 13:07
@gkiar gkiar marked this pull request as ready for review May 21, 2026 20:30
Comment thread bids2table/_entities.py Outdated
Comment thread bids2table/_indexing.py Outdated
@gkiar gkiar changed the title [WIP] Generalize entities Generalize entities May 28, 2026
@gkiar
Copy link
Copy Markdown
Collaborator Author

gkiar commented May 29, 2026

Thanks for the feedback, @effigies ! I think we're in pretty good shape now. I know @kaitj is clearing a few things off his plate before he evaluates the upstream PR to serialize the indexer which this depends on, then we can get his review here too. Getting there.

In the meantime, would it make sense to rebase your PR from this one, to get ahead of that?

@effigies
Copy link
Copy Markdown
Contributor

Sure, if this will go in first.

Comment thread bids2table/_indexing.py
Comment on lines +551 to +561
if description_exists:
desc = _read_dataset_description(path)
if desc:
dataset_type = desc.get("DatasetType", "raw")
if dataset_type in get_all_dataset_types():
if dataset_type == "raw":
return True
entity_types = get_entity_child_dirs(dataset_type, "root")
if entity_types:
return _contains_bids_entity_dirs(path, entity_types)
return True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strikes me as overly complicated. I think you just want to verify that desc['BIDSVersion'] exists, and then it's BIDS. IMO, if someone wants to index something without a visible dataset_description.json, you can provide a force option. But I also don't have a lot of context for why you were writing these heuristics in the first place.

Comment thread bids2table/_entities.py
entity_schema = {
entity: schema.objects.entities[entity].to_dict()
for entity in schema.rules.entities
for entity in schema.objects.entities
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was specifically used to enforce order. schema.rules.entities is a list with the global entity ordering.

Suggested change
for entity in schema.objects.entities
for entity in schema.rules.entities

Comment thread tests/test_indexing.py
Comment on lines +39 to +40
# find_bids_datasets now strictly follows BIDS schema for subject directories
# and only finds datasets with dataset_description.json
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it says you only find datasets with dataset_description.json, so maybe those heuristics should be dropped? OTOH, this test is now skipped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants