Generalize entities#86
Conversation
Splits the commenting from the CI by: 1. Performing the testing and uploading the coverage files in ci.yaml workflow 2. Downloading coverage and commenting, never checking out the code. This should allow for contributors (not part of the organization) to make PRs while still facilitating the coverage commenting without requiring the change of trigger from "pull_request" to "pull_request_target", which may introduce security vulunerabilities targetting the tests.
- Pass run-id/github-token so download-artifact reaches the CI run - Pipe PR number through pr-number.txt (workflow_run.pull_requests is empty for fork PRs) - Pass issue-number explicitly to the comment action - Gate coverage job on pull_request event + successful CI conclusion - Restore COVERAGE_PYTHON env var; cap artifact retention at 1 day
Fix coverage commenting from pull requests
- Add dependabot CI with monthly interval
- Bump ruff - Remove pandas from dev dependencies, already used in pybids extra
…t patterns, alone
Coverage Report
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
…d cohort nesting working
… without descriptions
|
Thanks for the feedback, @effigies ! I think we're in pretty good shape now. I know @kaitj is clearing a few things off his plate before he evaluates the upstream PR to serialize the indexer which this depends on, then we can get his review here too. Getting there. In the meantime, would it make sense to rebase your PR from this one, to get ahead of that? |
|
Sure, if this will go in first. |
| if description_exists: | ||
| desc = _read_dataset_description(path) | ||
| if desc: | ||
| dataset_type = desc.get("DatasetType", "raw") | ||
| if dataset_type in get_all_dataset_types(): | ||
| if dataset_type == "raw": | ||
| return True | ||
| entity_types = get_entity_child_dirs(dataset_type, "root") | ||
| if entity_types: | ||
| return _contains_bids_entity_dirs(path, entity_types) | ||
| return True |
There was a problem hiding this comment.
This strikes me as overly complicated. I think you just want to verify that desc['BIDSVersion'] exists, and then it's BIDS. IMO, if someone wants to index something without a visible dataset_description.json, you can provide a force option. But I also don't have a lot of context for why you were writing these heuristics in the first place.
| entity_schema = { | ||
| entity: schema.objects.entities[entity].to_dict() | ||
| for entity in schema.rules.entities | ||
| for entity in schema.objects.entities |
There was a problem hiding this comment.
This was specifically used to enforce order. schema.rules.entities is a list with the global entity ordering.
| for entity in schema.objects.entities | |
| for entity in schema.rules.entities |
| # find_bids_datasets now strictly follows BIDS schema for subject directories | ||
| # and only finds datasets with dataset_description.json |
There was a problem hiding this comment.
Here it says you only find datasets with dataset_description.json, so maybe those heuristics should be dropped? OTOH, this test is now skipped.
Downstream/On top of #85
PR Contribution Summary
This PR generalizes the indexing layer from subject-only discovery to entity-based crawling, and replaces hardcoded patterns with schema-derived logic wherever possible.
Architecture: Subject → Entity
_index_bids_subject_dir→_index_bids_entity_dir— indexes any entity directory (sub-*,tpl-*, etc.)_find_bids_subject_dirs→_find_bids_entity_dirs— discovers any entity type at a dataset root_is_bids_subject_dir→_is_bids_entity_dir— checks arbitrary entity type by name_entities.pyTemplate and Cohort Support
tpl-*andcohort-*directories are indexed alongsidesub-*_is_bids_datasetderivative checks look for subject OR template entity dirsSchema-Driven Discovery
get_entity_child_dirs(dataset_type, parent_rule)— reads valid entity subdirectories fromrules.directoriesget_file_entity_prefixes()— root-level entity name prefixes derived from schemaget_all_root_entity_types()— deduplicated root entity types across all dataset typesget_all_dataset_types()— enumerates schema-defined dataset types_BIDS_JSON_SIDECAR_EXCEPTION_SUFFIXES— derived fromrules.files(currentlycoordsystem,description)_BIDS_DATATYPE_PATTERN— built from entity names at schema init_ensure_dict()helper — centralizesbidsschematoolsNamespace→dict conversionDerivative Detection
_is_bids_dataset()and_get_dataset_type()detect derivative datasets withoutdataset_description.jsonby checking insidederivatives/for valid entity subdirectoriessub-*_ses-*directories (spec-invalid)Generic Filtering and .bidsignore
include_subjects→ generic filters dict mapping any entity name to glob patterns--filter/-fCLI argument replaces--subjects(deprecated, backward-compatible).bidsignoresupport via_is_bidsignoredwith cached upward searchbatch_index_datasetto workersDataset Metadata Columns
dataset_name,dataset_type,bids_versionadded to Arrow schema, populated fromdataset_description.jsonclear_schema_caches()exposed as public API for schema reload safetyCode Cleanup
get_all_entity_prefixes,get_required_entity_types_get_subdir_names()for oneOf expansion_read_dataset_descriptionwith@lru_cacheto deduplicate reads_resolve_entity_dirs— extracts entity discovery into_discover_entity_dirsTesting
test_derivative_detection— 5 scenarios including no-description derivatives and invalid combined entity dirstest_index_dataset_filters— single, multi-value, glob, and cross-entity AND filterstest_batch_index_dataset_filters— filter forwarding through parallel workerstest_index_dataset_bidsignore—.bidsignoreexclusion@templateflow_availabletest_is_bids_subject_dir→test_is_bids_entity_dirtest_find_bids_datasetsis now skipped (@pytest.mark.skip); therglob("dataset_description.json")baseline no longer matches the schema-correct derivative detectionImpact
We should now be less fragile in schema updates, and can correctly index derivative datasets using entity types other than subject and session (namely template and cohort), meaning this can be used across a wide range of the field's projects.