Skip to content

Comments

Improve reading from tar archives#178

Merged
schlegelp merged 5 commits intomasterfrom
read_tar_fix
Jan 3, 2025
Merged

Improve reading from tar archives#178
schlegelp merged 5 commits intomasterfrom
read_tar_fix

Conversation

@schlegelp
Copy link
Collaborator

@schlegelp schlegelp commented Jan 2, 2025

Addresses #173 by using archive file stream instead of random access when reading from tar archives.

Due to the way this is implemented now, we won't be using parallel processes (i.e. the parallel parameter is ignored). We could create chunks of files that are adjacent in the archive and split the chunks across multiple processes. However, that in turn would generate issues with the process bar.

Ultimately, the file streaming seems to be very performant (possibly because we're not having to open/close individual files?) and I'm not too worried about performance. On my machine I can read the tar archive with 97k hemibrain skeletons in around 3 minutes which isn't too shabby.

In addition to the above this PR contains:

  • making read_swc more robust against unexpected number of columns
  • following URL changes in two of the tutorials (download.brainlib.org:8811 -> download.brainimagelibrary.org)

@schlegelp schlegelp changed the title Fix reading from tar archives Improve reading from tar archives Jan 3, 2025
@schlegelp schlegelp merged commit ab4de9e into master Jan 3, 2025
20 of 21 checks passed
@schlegelp schlegelp deleted the read_tar_fix branch January 3, 2025 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant