Generalize entities by gkiar · Pull Request #86 · childmindresearch/bids2table

gkiar · 2026-05-21T01:26:15Z

Downstream/On top of #85

PR Contribution Summary

This PR generalizes the indexing layer from subject-only discovery to entity-based crawling, and replaces hardcoded patterns with schema-derived logic wherever possible.

Architecture: Subject → Entity

_index_bids_subject_dir → _index_bids_entity_dir — indexes any entity directory (sub-*, tpl-*, etc.)
_find_bids_subject_dirs → _find_bids_entity_dirs — discovers any entity type at a dataset root
_is_bids_subject_dir → _is_bids_entity_dir — checks arbitrary entity type by name
format_bids_path now uses schema-derived directory hierarchy (TPL → Cohort → Sub → Ses → Datatype)
All schema-related discovery functions centralized in _entities.py

Template and Cohort Support

tpl-* and cohort-* directories are indexed alongside sub-*
_is_bids_dataset derivative checks look for subject OR template entity dirs
Verified with real TemplateFlow datasets (1590 files, 30 templates)

Schema-Driven Discovery

get_entity_child_dirs(dataset_type, parent_rule) — reads valid entity subdirectories from rules.directories
get_file_entity_prefixes() — root-level entity name prefixes derived from schema
get_all_root_entity_types() — deduplicated root entity types across all dataset types
get_all_dataset_types() — enumerates schema-defined dataset types
_BIDS_JSON_SIDECAR_EXCEPTION_SUFFIXES — derived from rules.files (currently coordsystem, description)
_BIDS_DATATYPE_PATTERN — built from entity names at schema init
_ensure_dict() helper — centralizes bidsschematools Namespace→dict conversion

Derivative Detection

_is_bids_dataset() and _get_dataset_type() detect derivative datasets without dataset_description.json by checking inside derivatives/ for valid entity subdirectories
Correctly rejects combined sub-*_ses-* directories (spec-invalid)
Fallback iterates all dataset types when the detected type yields no entity dirs

Generic Filtering and .bidsignore

include_subjects → generic filters dict mapping any entity name to glob patterns
--filter / -f CLI argument replaces --subjects (deprecated, backward-compatible)
.bidsignore support via _is_bidsignored with cached upward search
Filters forwarded through batch_index_dataset to workers

Dataset Metadata Columns

dataset_name, dataset_type, bids_version added to Arrow schema, populated from dataset_description.json
clear_schema_caches() exposed as public API for schema reload safety

Code Cleanup

Removed dead code: get_all_entity_prefixes, get_required_entity_types
Deduplicated _get_subdir_names() for oneOf expansion
_read_dataset_description with @lru_cache to deduplicate reads
Simplified _resolve_entity_dirs — extracts entity discovery into _discover_entity_dirs
Updated stale comments and removed redundant wrapper functions

Testing

test_derivative_detection — 5 scenarios including no-description derivatives and invalid combined entity dirs
test_index_dataset_filters — single, multi-value, glob, and cross-entity AND filters
test_batch_index_dataset_filters — filter forwarding through parallel workers
test_index_dataset_bidsignore — .bidsignore exclusion
Template integration tests gated by @templateflow_available
Renamed test_is_bids_subject_dir → test_is_bids_entity_dir
test_find_bids_datasets is now skipped (@pytest.mark.skip); the rglob("dataset_description.json") baseline no longer matches the schema-correct derivative detection

Impact

We should now be less fragile in schema updates, and can correctly index derivative datasets using entity types other than subject and session (namely template and cohort), meaning this can be used across a wide range of the field's projects.

Splits the commenting from the CI by: 1. Performing the testing and uploading the coverage files in ci.yaml workflow 2. Downloading coverage and commenting, never checking out the code. This should allow for contributors (not part of the organization) to make PRs while still facilitating the coverage commenting without requiring the change of trigger from "pull_request" to "pull_request_target", which may introduce security vulunerabilities targetting the tests.

- Pass run-id/github-token so download-artifact reaches the CI run - Pipe PR number through pr-number.txt (workflow_run.pull_requests is empty for fork PRs) - Pass issue-number explicitly to the comment action - Gate coverage job on pull_request event + successful CI conclusion - Restore COVERAGE_PYTHON env var; cap artifact retention at 1 day

Fix coverage commenting from pull requests

- Add dependabot CI with monthly interval

- Bump ruff - Remove pandas from dev dependencies, already used in pybids extra

…t patterns, alone

github-actions · 2026-05-21T01:38:43Z

Coverage Report

File	Stmts	Miss	Cover	Missing
__init__.py	7	0	100%
__main__.py	64	5	92%	101, 127, 155, 159, 163
_entities.py	112	1	99%	129
_indexing.py	228	7	96%	154, 163–164, 179, 357, 407, 445
_logging.py	31	4	87%	30, 37, 39–40
_metadata.py	48	4	91%	39–40, 66, 71
_pathlib.py	17	3	82%	12–13, 15
_version.py	11	0	100%
pybids
__init__.py	4	0	100%
_bidsfile.py	38	13	65%	71–73, 77–79, 83–85, 89–91, 95
_layout.py	156	45	71%	63, 72, 81, 104, 114–115, 118, 140–141, 156–157, 173–174, 177–181, 186, 188–189, 192–193, 228, 233, 241, 322–324, 389–394, 396, 399–404, 406, 462, 482
_utils.py	13	5	61%	47–50, 52
TOTAL	729	87	88%

Tests	Skipped	Failures	Errors	Time
100	1 💤	0 ❌	0 🔥	23.612s ⏱️

…d cohort nesting working

… build

… without descriptions

gkiar · 2026-05-29T15:26:55Z

Thanks for the feedback, @effigies ! I think we're in pretty good shape now. I know @kaitj is clearing a few things off his plate before he evaluates the upstream PR to serialize the indexer which this depends on, then we can get his review here too. Getting there.

In the meantime, would it make sense to rebase your PR from this one, to get ahead of that?

effigies · 2026-05-29T15:31:54Z

Sure, if this will go in first.

effigies · 2026-05-31T23:39:10Z

+    if description_exists:
+        desc = _read_dataset_description(path)
+        if desc:
+            dataset_type = desc.get("DatasetType", "raw")
+            if dataset_type in get_all_dataset_types():
+                if dataset_type == "raw":
+                    return True
+                entity_types = get_entity_child_dirs(dataset_type, "root")
+                if entity_types:
+                    return _contains_bids_entity_dirs(path, entity_types)
+                return True


This strikes me as overly complicated. I think you just want to verify that desc['BIDSVersion'] exists, and then it's BIDS. IMO, if someone wants to index something without a visible dataset_description.json, you can provide a force option. But I also don't have a lot of context for why you were writing these heuristics in the first place.

effigies · 2026-05-31T23:46:46Z

    entity_schema = {
        entity: schema.objects.entities[entity].to_dict()
-        for entity in schema.rules.entities
+        for entity in schema.objects.entities


This was specifically used to enforce order. schema.rules.entities is a list with the global entity ordering.

Suggested change

for entity in schema.objects.entities

for entity in schema.rules.entities

effigies · 2026-05-31T23:53:36Z

+    # find_bids_datasets now strictly follows BIDS schema for subject directories
+    # and only finds datasets with dataset_description.json


Here it says you only find datasets with dataset_description.json, so maybe those heuristics should be dropped? OTOH, this test is now skipped.

kaitj and others added 9 commits May 11, 2026 09:59

Merge pull request #77 from childmindresearch/maint/ci-coverage

8cee006

Fix coverage commenting from pull requests

Add pybids api to docs

f3b0394

Update CI dependencies

bcfd7a6

- Add dependabot CI with monthly interval

Update project dependencies

39b814f

- Bump ruff - Remove pandas from dev dependencies, already used in pybids extra

generalized indexing to use bids schema rather than hard coded subjec…

049fb42

…t patterns, alone

ruff ruff 🐶

c7c37e0

apparently i ruffd wrong

a4e2a90

gkiar changed the base branch from main to serial-index May 21, 2026 13:07

gkiar added 8 commits May 21, 2026 11:19

generalized indexing to find template-style bids directories

26bd475

(not yet optimized, but functional?) generalized indexing with tpl an…

c456d5d

…d cohort nesting working

minor streamline and refactor, fixed path construction

5525f85

refactor to reduce hardcoding and redundancy from old subject-focused…

d3aa51a

… build

added more tests and fixed bug incorrectly excluding derived datasets…

7870c69

… without descriptions

further removed hardcoded constants

1ce4a6f

removed some dead code, refactored fragile code

f4f3231

minor test and helper function refactor

5788746

gkiar marked this pull request as ready for review May 21, 2026 20:30

reverted unnecessary docstring change

7078907

effigies reviewed May 21, 2026

View reviewed changes

Comment thread bids2table/_entities.py Outdated

effigies reviewed May 21, 2026

View reviewed changes

Comment thread bids2table/_indexing.py Outdated

gkiar added 2 commits May 27, 2026 13:12

added bids schema format patterns directly

4bb432c

removed recursive .bidsignore crawling

9ca7b84

gkiar changed the title ~~[WIP] Generalize entities~~ Generalize entities May 28, 2026

Merge branch 'main' into generalize-entities

ddea30d

effigies reviewed May 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize entities#86

Generalize entities#86
gkiar wants to merge 21 commits into
serial-indexfrom
generalize-entities

gkiar commented May 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

Uh oh!

Uh oh!

gkiar commented May 29, 2026

Uh oh!

effigies commented May 29, 2026

Uh oh!

effigies May 31, 2026

Uh oh!

effigies May 31, 2026

Uh oh!

effigies May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	for entity in schema.objects.entities
	for entity in schema.rules.entities

		# find_bids_datasets now strictly follows BIDS schema for subject directories
		# and only finds datasets with dataset_description.json

Conversation

gkiar commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Contribution Summary

Architecture: Subject → Entity

Template and Cohort Support

Schema-Driven Discovery

Derivative Detection

Generic Filtering and .bidsignore

Dataset Metadata Columns

Code Cleanup

Testing

Impact

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

Uh oh!

Uh oh!

gkiar commented May 29, 2026

Uh oh!

effigies commented May 29, 2026

Uh oh!

effigies May 31, 2026

Choose a reason for hiding this comment

Uh oh!

effigies May 31, 2026

Choose a reason for hiding this comment

Uh oh!

effigies May 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gkiar commented May 21, 2026 •

edited

Loading