Skip to content

Commit c1d67b2

Browse files
authored
Merge pull request #67 from childmindresearch/enh/cloud-gs
Add gs support
2 parents 32ed4a5 + f651d79 commit c1d67b2

7 files changed

Lines changed: 935 additions & 1733 deletions

File tree

README.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,16 +16,19 @@ To install the latest release from pypi, you can run
1616
pip install bids2table
1717
```
1818

19-
To install with S3 support, include the `s3` extra
19+
To install with cloud support, include the `cloud` extra
2020

2121
```sh
22-
pip install bids2table[s3]
22+
pip install bids2table[cloud]
2323
```
2424

25+
> [!WARNING]
26+
> Previous version only supported s3. s3 installation is still supported, but will be deprecated in the next version. Please update any installation scripts.
27+
2528
The latest development version can be installed with
2629

2730
```sh
28-
pip install "bids2table[s3] @ git+https://github.com/childmindresearch/bids2table.git"
31+
pip install "bids2table[cloud] @ git+https://github.com/childmindresearch/bids2table.git"
2932
```
3033

3134
## Usage

bids2table/__init__.py

Lines changed: 10 additions & 139 deletions
Original file line numberDiff line numberDiff line change
@@ -1,133 +1,4 @@
1-
# ruff: noqa: I001
2-
r"""
3-
[![CI](https://github.com/childmindresearch/bids2table/actions/workflows/ci.yaml/badge.svg?branch=main)](https://github.com/childmindresearch/bids2table/actions/workflows/ci.yaml?query=branch%3Amain)
4-
[![Docs](https://github.com/childmindresearch/bids2table/actions/workflows/docs.yaml/badge.svg?branch=main)](https://childmindresearch.github.io/bids2table/bids2table)
5-
[![codecov](https://codecov.io/gh/childmindresearch/bids2table/branch/main/graph/badge.svg?token=22HWWFWPW5)](https://codecov.io/gh/childmindresearch/bids2table)
6-
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
7-
![Python3](https://img.shields.io/badge/python->=3.11-blue.svg)
8-
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
9-
10-
Index [BIDS](https://bids-specification.readthedocs.io/en/stable/) datasets fast, locally or in the cloud.
11-
12-
## Installation
13-
14-
To install the latest release from pypi, you can run
15-
16-
```sh
17-
pip install bids2table
18-
```
19-
20-
To install with S3 support, include the `s3` extra
21-
22-
```sh
23-
pip install bids2table[s3]
24-
```
25-
26-
The latest development version can be installed with
27-
28-
```sh
29-
pip install "bids2table[s3] @ git+https://github.com/childmindresearch/bids2table.git"
30-
```
31-
32-
## Usage
33-
34-
To run these examples, you will need to clone the [bids-examples](https://github.com/bids-standard/bids-examples) repo.
35-
36-
```sh
37-
git clone -b 1.9.0 https://github.com/bids-standard/bids-examples.git
38-
```
39-
40-
### Finding BIDS datasets
41-
42-
You can search a directory for valid BIDS datasets using `b2t2 find`
43-
44-
```
45-
(bids2table) clane$ b2t2 find bids-examples | head -n 10
46-
bids-examples/asl002
47-
bids-examples/ds002
48-
bids-examples/ds005
49-
bids-examples/asl005
50-
bids-examples/ds051
51-
bids-examples/eeg_rishikesh
52-
bids-examples/asl004
53-
bids-examples/asl003
54-
bids-examples/ds003
55-
bids-examples/eeg_cbm
56-
```
57-
58-
### Indexing datasets from the command line
59-
60-
Indexing datasets is done with `b2t2 index`. Here we index a single example dataset, saving the output as a parquet file.
61-
62-
```
63-
(bids2table) clane$ b2t2 index -o ds102.parquet bids-examples/ds102
64-
ds102: 100%|███████████████████████████████████████| 26/26 [00:00<00:00, 154.12it/s, sub=26, N=130]
65-
```
66-
67-
You can also index a list of datasets. Note that each iteration in the progress bar represents one dataset.
68-
69-
```
70-
(bids2table) clane$ b2t2 index -o bids-examples.parquet bids-examples/*
71-
100%|████████████████████████████████████████████| 87/87 [00:00<00:00, 113.59it/s, ds=None, N=9727]
72-
```
73-
74-
You can pipe the output of `b2t2 find` to `b2t2 index` to create an index of all datasets under a root directory.
75-
76-
```
77-
(bids2table) clane$ b2t2 find bids-examples | b2t2 index -o bids-examples.parquet
78-
97it [00:01, 96.05it/s, ds=ieeg_filtered_speech, N=10K]
79-
```
80-
81-
The resulting index will include both top-level datasets (as in the previous command) as well nested derivatives datasets.
82-
83-
### Indexing datasets hosted on S3
84-
85-
bids2table supports indexing datasets hosted on S3 via [cloudpathlib](https://github.com/drivendataorg/cloudpathlib). To use this functionality, make sure to install bids2table with the `s3` extra. Or you can also just install cloudpathlib directly
86-
87-
```sh
88-
pip install cloudpathlib[s3]
89-
```
90-
91-
As an example, here we index all datasets on [OpenNeuro](https://openneuro.org/)
92-
93-
```
94-
(bids2table) clane$ b2t2 index -o openneuro.parquet \
95-
-j 8 --use-threads s3://openneuro.org/ds*
96-
100%|█████████████████████████████████████| 1408/1408 [12:25<00:00, 1.89it/s, ds=ds006193, N=1.2M]
97-
```
98-
99-
Using 8 threads, we can index all ~1400 OpenNeuro datasets (1.2M files) in less than 15 minutes.
100-
101-
102-
### Indexing datasets from python
103-
104-
You can also index datasets using the Python API.
105-
106-
```python
107-
import bids2table as b2t2
108-
import pandas as pd
109-
import pyarrow as pa
110-
import pyarrow.parquet as pq
111-
112-
# Index a single dataset.
113-
tab = b2t2.index_dataset("bids-examples/ds102")
114-
115-
# Find and index a batch of datasets.
116-
tabs = b2t2.batch_index_dataset(
117-
b2t2.find_bids_datasets("bids-examples"),
118-
)
119-
tab = pa.concat_tables(tabs)
120-
121-
# Index a dataset on S3.
122-
tab = b2t2.index_dataset("s3://openneuro.org/ds000224")
123-
124-
# Save as parquet.
125-
pq.write_table(tab, "ds000224.parquet")
126-
127-
# Convert to a pandas dataframe.
128-
df = tab.to_pandas(types_mapper=pd.ArrowDtype)
129-
```
130-
"""
1+
""".. include:: ../README.md""" # noqa: D415
1312

1323
__all__ = [
1334
"index_dataset",
@@ -145,20 +16,20 @@
14516
"cloudpathlib_is_available",
14617
]
14718

19+
from ._entities import (
20+
format_bids_path,
21+
get_bids_entity_arrow_schema,
22+
get_bids_schema,
23+
parse_bids_entities,
24+
set_bids_schema,
25+
validate_bids_entities,
26+
)
14827
from ._indexing import (
149-
index_dataset,
15028
batch_index_dataset,
15129
find_bids_datasets,
15230
get_arrow_schema,
15331
get_column_names,
154-
)
155-
from ._entities import (
156-
parse_bids_entities,
157-
validate_bids_entities,
158-
set_bids_schema,
159-
get_bids_schema,
160-
get_bids_entity_arrow_schema,
161-
format_bids_path,
32+
index_dataset,
16233
)
16334
from ._metadata import load_bids_metadata
16435
from ._pathlib import cloudpathlib_is_available

bids2table/__main__.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -160,10 +160,10 @@ def _find_command(args: argparse.Namespace):
160160

161161

162162
def _check_path(path: str):
163-
if path.startswith("s3://") and not b2t2.cloudpathlib_is_available():
163+
if path.startswith(("s3://", "gs://")) and not b2t2.cloudpathlib_is_available():
164164
_logger.error(
165-
"Cloudpathlib is required to use S3 paths. "
166-
"Install with e.g. `pip install cloudpathlib[s3]`."
165+
"Cloudpathlib is required to use cloud paths. "
166+
"Install with e.g. `pip install cloudpathlib[cloud]`."
167167
)
168168
sys.exit(1)
169169

bids2table/_pathlib.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,13 @@
11
from pathlib import Path
22

33
try:
4-
from cloudpathlib import AnyPath, CloudPath, S3Client
4+
from cloudpathlib import AnyPath, CloudPath, GSClient, S3Client
55

66
_CLOUDPATHLIB_AVAILABLE = True
77

8-
# Set unsigned client as default for s3:// paths
8+
# Set default clients for cloud paths
99
S3Client(no_sign_request=True).set_as_default_client()
10+
GSClient().set_as_default_client()
1011

1112
except ImportError:
1213
AnyPath = CloudPath = Path

pyproject.toml

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -25,21 +25,22 @@ classifiers = [
2525
"Operating System :: Microsoft :: Windows",
2626
]
2727

28-
dependencies = ["bidsschematools>=1.0", "pyarrow>=20.0.0", "tqdm>=4.67.1"]
28+
dependencies = ["bidsschematools>=1.0", "pyarrow>=24.0.0", "tqdm>=4.67.3"]
2929

3030
[project.optional-dependencies]
31-
s3 = ["cloudpathlib[s3]>=0.21.0"]
31+
cloud = ["cloudpathlib[s3,gs]>=0.21.0"]
32+
s3 = [
33+
"cloudpathlib[s3]>=0.21.0",
34+
] # Include s3 to not break backwards compatibility
3235

3336
[dependency-groups]
3437
dev = [
35-
"ipython>=9.2.0",
36-
"jupyter>=1.1.1",
37-
"pandas==2.2.3",
38-
"pdoc>=15.0.3",
39-
"pre-commit>=4.1.0",
40-
"pytest>=8.3.5",
41-
"pytest-cov>=6.0.0",
42-
"ruff>=0.11.9",
38+
"pandas==3.0.2",
39+
"pdoc>=16.0.0",
40+
"pre-commit>=4.6.0",
41+
"pytest>=9.0.3",
42+
"pytest-cov>=7.1.0",
43+
"ruff>=0.15.12",
4344
]
4445

4546
[project.urls]

tests/test_indexing.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
def test_get_arrow_schema():
1616
schema = indexing.get_arrow_schema()
1717
# NOTE: this will change if the BIDS entity schema changes.
18-
assert len(schema) == 38
18+
assert len(schema) == 42
1919

2020

2121
def test_get_column_names():

0 commit comments

Comments
 (0)