Skip to content

Commit 8dac968

Browse files
committed
Add docstrings and update README.md
1 parent c4868ee commit 8dac968

4 files changed

Lines changed: 133 additions & 14 deletions

File tree

README.md

Lines changed: 101 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,8 @@
11
# bids2table
2-
<!-- [![CI](https://github.com/childmindresearch/bids2table/actions/workflows/ci.yaml/badge.svg?branch=main)](https://github.com/childmindresearch/bids2table/actions/workflows/ci.yaml?query=branch%3Amain)
3-
[![Documentation](https://img.shields.io/badge/documentation-8CA1AF?logo=readthedocs&logoColor=fff)](https://childmindresearch.github.io/bids2table)
4-
[![codecov](https://codecov.io/gh/childmindresearch/bids2table/branch/main/graph/badge.svg?token=22HWWFWPW5)](https://codecov.io/gh/childmindresearch/bids2table) -->
2+
[![CI](https://github.com/childmindresearch/bids2table/actions/workflows/ci.yaml/badge.svg?branch=main)](https://github.com/childmindresearch/bids2table/actions/workflows/ci.yaml?query=branch%3Amain)
3+
[![codecov](https://codecov.io/gh/childmindresearch/bids2table/branch/main/graph/badge.svg?token=22HWWFWPW5)](https://codecov.io/gh/childmindresearch/bids2table)
54
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
6-
![Python3](https://img.shields.io/badge/python->=3.11-blue.svg)
5+
![Python3](https://img.shields.io/badge/python->=3.12-blue.svg)
76
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
87

98
Index BIDS datasets fast, locally or in the cloud.
@@ -13,5 +12,102 @@ Index BIDS datasets fast, locally or in the cloud.
1312
The latest development version can be installed with
1413

1514
```sh
16-
pip install git+https://github.com/childmindresearch/bids2table.git@develop/b2t2
15+
pip install "bids2table @ git+https://github.com/childmindresearch/bids2table.git@develop/b2t2"
16+
```
17+
18+
To install with S3 support, include the `s3` extra
19+
20+
```sh
21+
pip install "bids2table[s3] @ git+https://github.com/childmindresearch/bids2table.git@develop/b2t2"
22+
```
23+
24+
## Usage
25+
26+
### Finding BIDS datasets
27+
28+
You can search a directory for valid BIDS datasets using `b2t2 find`
29+
30+
```
31+
(bids2table) clane$ b2t2 find bids-examples | head -n 10
32+
bids-examples/asl002
33+
bids-examples/ds002
34+
bids-examples/ds005
35+
bids-examples/asl005
36+
bids-examples/ds051
37+
bids-examples/eeg_rishikesh
38+
bids-examples/asl004
39+
bids-examples/asl003
40+
bids-examples/ds003
41+
bids-examples/eeg_cbm
42+
```
43+
44+
### Indexing datasets from the command line
45+
46+
Indexing datasets is done with `b2t2 index`. Here we index a single example dataset, saving the output as a parquet file.
47+
48+
```
49+
(bids2table) clane$ b2t2 index -v -o ds102.parquet bids-examples/ds102
50+
ds102: 100%|███████████████████████████████████████| 26/26 [00:00<00:00, 154.12it/s, sub=26, N=130]
51+
```
52+
53+
You can also index a list of datasets. Note that each iteration in the progress bar represents one dataset.
54+
55+
```
56+
(bids2table) clane$ b2t2 index -v -o bids-examples.parquet bids-examples/*
57+
100%|████████████████████████████████████████████| 87/87 [00:00<00:00, 113.59it/s, ds=None, N=9727]
58+
```
59+
60+
You can pipe the output of `b2t2 find` to `b2t2 index` to create an index of all datasets under a root directory.
61+
62+
```
63+
(bids2table) clane$ b2t2 find bids-examples | b2t2 index -v -o bids-examples.parquet
64+
97it [00:01, 96.05it/s, ds=ieeg_filtered_speech, N=10K]
65+
```
66+
67+
The resulting index will include both top-level datasets (as in the previous command) as well nested derivatives datasets.
68+
69+
### Indexing datasets hosted on S3
70+
71+
bids2table supports indexing datasets hosted on S3 via [cloudpathlib](https://github.com/drivendataorg/cloudpathlib). To use this functionality, install cloudpathlib with S3 support
72+
73+
```sh
74+
pip install cloudpathlib[s3]
75+
```
76+
77+
You can also install bids2table with the s3 extra
78+
79+
```sh
80+
pip install "bids2table[s3] @ git+https://github.com/childmindresearch/bids2table.git@develop/b2t2"
81+
```
82+
83+
As an example, here we index all datasets on [OpenNeuro](https://openneuro.org/)
84+
85+
```
86+
(bids2table) clane$ b2t2 index -v -o openneuro.parquet \
87+
-j 8 --use-threads s3://openneuro.org/ds*
88+
100%|█████████████████████████████████████| 1408/1408 [12:25<00:00, 1.89it/s, ds=ds006193, N=1.2M]
89+
```
90+
91+
Using 8 threads, we can index all ~1400 OpenNeuro datasets (1.2M files) in less than 15 minutes.
92+
93+
94+
### Indexing datasets from python
95+
96+
You can also index datasets in Python using the Python API.
97+
98+
```python
99+
import pyarrow as pa
100+
import bids2table as b2t2
101+
102+
# Index a single dataset.
103+
tab = b2t2.index_dataset("bids-examples/ds102")
104+
105+
# Find and index a batch of datasets.
106+
tabs = b2t2.batch_index_dataset(
107+
b2t2.find_bids_datasets("bids-examples"),
108+
)
109+
tab = pa.concat_tables(tabs)
110+
111+
# Index a dataset on S3.
112+
tab = b2t2.index_dataset("s3://openneuro.org/ds000224")
17113
```

bids2table/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,14 @@
77
set_bids_schema,
88
get_bids_schema,
99
get_bids_entity_arrow_schema,
10+
format_bids_path,
1011
)
1112
from ._indexing import (
1213
find_bids_datasets,
1314
index_dataset,
1415
batch_index_dataset,
1516
get_arrow_schema,
17+
get_column_names,
1618
)
1719
from ._pathlib import Path, cloudpathlib_is_available
1820
from ._version import *

bids2table/_entities.py

Lines changed: 26 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@
7474
_logger = setup_logger(__package__)
7575

7676

77-
def set_bids_schema(path: str | Path | None = None):
77+
def set_bids_schema(path: str | Path | None = None) -> None:
7878
"""Set the BIDS schema."""
7979
global _BIDS_SCHEMA, _BIDS_ENTITY_SCHEMA, _BIDS_NAME_ENTITY_MAP
8080
global _BIDS_ENTITY_ARROW_SCHEMA
@@ -134,12 +134,17 @@ def get_bids_entity_arrow_schema() -> pa.Schema:
134134
return _BIDS_ENTITY_ARROW_SCHEMA
135135

136136

137-
@lru_cache()
138137
def parse_bids_entities(path: str | Path) -> dict[str, str]:
139138
"""Parse entities from BIDS file path.
140139
141140
Parses all BIDS filename `"{key}-{value}"` entities as well as special entities:
142141
datatype, suffix, ext (extension). Does not validate entities or cast to types.
142+
143+
Args:
144+
path: BIDS path to parse.
145+
146+
Returns:
147+
entities: dict mapping BIDS entity keys to values.
143148
"""
144149
if isinstance(path, str):
145150
path = Path(path)
@@ -175,6 +180,11 @@ def parse_bids_entities(path: str | Path) -> dict[str, str]:
175180
return entities
176181

177182

183+
# Version with caching to use internally. Decorating the public function loses the
184+
# docstring.
185+
_cache_parse_bids_entities = lru_cache(parse_bids_entities)
186+
187+
178188
def _parse_bids_datatype(path: Path) -> str | None:
179189
"""Parse BIDS datatype from file path.
180190
@@ -192,8 +202,14 @@ def validate_bids_entities(
192202
"""Validate BIDS entities.
193203
194204
Validates the type and allowed values of each entity against the BIDS schema.
195-
Returns a tuple of `valid_entities` as well as any leftover `extra_entities` that
196-
don't match a known entity.
205+
206+
Args:
207+
entities: dict mapping BIDS keys to unvalidated entities
208+
209+
Returns:
210+
valid_entities: A mapping of valid BIDS keys to type-casted values.
211+
extra_entities: A mapping of any leftover entity mappings that didn't match a
212+
known entity or failed validation.
197213
"""
198214
valid_entities = {}
199215
extra_entities = {}
@@ -233,7 +249,12 @@ def validate_bids_entities(
233249
def format_bids_path(entities: dict[str, Any], int_format: str = "%d") -> Path:
234250
"""Construct a formatted BIDS path from entities dict.
235251
236-
Integer (index) value indices are formatted using the `int_format`.
252+
Args:
253+
entities: dict mapping BIDS keys to values.
254+
int_format: format string for integer (index) BIDS values.
255+
256+
Returns:
257+
path: formatted `Path` instance.
237258
"""
238259
special = {"datatype", "suffix", "ext"}
239260

bids2table/_indexing.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,8 @@
1616
from tqdm import tqdm
1717

1818
from ._entities import (
19+
_cache_parse_bids_entities,
1920
get_bids_entity_arrow_schema,
20-
parse_bids_entities,
2121
validate_bids_entities,
2222
)
2323
from ._logging import setup_logger
@@ -379,7 +379,7 @@ def _index_bids_subject_dir(
379379

380380
records = []
381381
for p in _find_bids_files(path):
382-
entities = parse_bids_entities(p)
382+
entities = _cache_parse_bids_entities(p)
383383
valid_entities, extra_entities = validate_bids_entities(entities)
384384
record = {
385385
"dataset": dataset,
@@ -412,7 +412,7 @@ def _is_bids_file(path: Path) -> bool:
412412
if path.suffix == "" or not path.name.startswith("sub-"):
413413
return False
414414

415-
entities = parse_bids_entities(path)
415+
entities = _cache_parse_bids_entities(path)
416416
# if not (entities.get("suffix") and entities.get("datatype")):
417417
if not (entities.get("suffix") and entities.get("ext")):
418418
return False
@@ -435,7 +435,7 @@ def _is_bids_json_sidecar(path: Path) -> bool:
435435
return False
436436

437437
# Other checks require entities.
438-
entities = parse_bids_entities(path)
438+
entities = _cache_parse_bids_entities(path)
439439

440440
# Second pass using full compound extension, in case of data files that use a
441441
# compound extension ending in .json.

0 commit comments

Comments
 (0)