Skip to content

bids2table 2.0#48

Merged
clane9 merged 25 commits into
mainfrom
develop/b2t2
May 6, 2025
Merged

bids2table 2.0#48
clane9 merged 25 commits into
mainfrom
develop/b2t2

Conversation

@clane9
Copy link
Copy Markdown
Contributor

@clane9 clane9 commented May 5, 2025

This is a full rewrite of bids2table aiming to make the project much simpler while also adding a few useful new features. The major changes are:

  • Reduce lines of code from 2150 (across bids2table and elbow) to 720 lines.
  • Reduce dependencies to only bidschematools and pyarrow (and tqdm).
  • Improve runtime performance. Indexing bids-examples on my laptop went from 2.5s to 650ms (4x speedup).
  • Add support for indexing datasets hosted in the cloud via cloudpathlib. This lets us for example index all of OpenNeuro (~1400 datasets, 1.2M files) in less than 15 minutes.
(bids2table) clane$ b2t2 index -v -o openneuro.parquet -j 8 --use-threads s3://openneuro.org/ds*
100%|█████████████████████████████████████| 1408/1408 [12:25<00:00,  1.89it/s, ds=ds006193, N=1.2M]

Credit to @nx10 for suggesting these directions and providing useful feedback.

cc: @effigies, @adelavega

clane9 added 17 commits April 30, 2025 16:48
- BIDS schema loaded from `bidschematools` following previous impl (thx
  @nx10).
- Add Arrow schema construction
- Otherwise, trying to make much simpler. No dataclass complexity.
New indexing uses just `Path` operations and string processing.
Generates arrow tables directly with no other dependencies. ~400 LOC.
Supports indexing cloud-hosted datasets via `cloudpathlib`.

Thanks to @nx10 for these suggestions.
- Don't raise error, just warn and return empty table if a dataset is empty.
- Add package-level exports.
TODO: test
Accepting unexpanded glob patterns in CLI is useful for cloud sources.
May as well accept a list of them. Few other misc cleanup (simplify log
messages, progress bar).
- Format large file counts in human readable units.
- Change verbosity order: progress bar -> warnings.
- Filter repeated logging messages.
- Initialize logger state in each process when using
  `ProcessPoolExecutor`.
A dynamically defined global enum for the column names is a bit awkward
and not that useful. The column names are not available statically in
the editor for example. Remove and replace with a `get_column_names`
function which dynamically defines the column name enum.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2025

Codecov Report

Attention: Patch coverage is 94.76190% with 22 lines in your changes missing coverage. Please review.

Project coverage is 94.61%. Comparing base (9322fcf) to head (db55a00).
Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
bids2table/_indexing.py 95.87% 8 Missing ⚠️
bids2table/__main__.py 89.06% 7 Missing ⚠️
bids2table/_logging.py 87.09% 4 Missing ⚠️
bids2table/_pathlib.py 84.61% 2 Missing ⚠️
bids2table/_entities.py 99.12% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #48      +/-   ##
==========================================
+ Coverage   93.18%   94.61%   +1.42%     
==========================================
  Files          10        6       -4     
  Lines         558      427     -131     
==========================================
- Hits          520      404     -116     
+ Misses         38       23      -15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

clane9 added 8 commits May 5, 2025 14:39
Make informational, don't fail CI jobs
Disable progress bar by default in python index functions. Enable
progress bar by default in CLI. Add a `--no-progress` flag.
@clane9 clane9 merged commit ecb8a57 into main May 6, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant