perf: add parallelization by gabyx · Pull Request #34 · sdsc-ordes/tripsu

gabyx · 2024-07-12T08:33:14Z

Proposed Changes

Add parallelization over rayon which lets you convert an iterator to a ParallelIterator which
is parallelized over rayon's thread pool.
Note: rio's into_iterator is abit stupid written, dont see the intention to store a full Vec<T>
Add a CLI flag --parallel which will use parallelization
If the parallelization pays out has yet to be tested. The Mutex on the Writer is not particularly good I guess. -> use slogs async writter which is buffered and is Send + Sync...

Types of Changes

What types of changes does your code introduce? Put an x in the boxes that
apply

A bug fix (non-breaking change which fixes an issue).
A new feature (non-breaking change which adds functionality).
A breaking change (fix or feature that would cause existing
functionality to not work as expected).
A non-productive update (documentation, tooling, etc. if none of the
other choices apply).

Checklist

Put an x in the boxes that apply. You can also fill these out after creating
the PR. If you're unsure about any of them, don't hesitate to ask. We're here to
help! This is simply a reminder of what we are going to look for before merging
your code.

I have read the
CONTRIBUTING
guidelines.
I have added tests that prove my fix is effective or that my feature
works.
I have added the necessary documentation (if appropriate).

Further Comments

cmdoret

The clippy settings and code cleanup are welcome additions, but the parallelization of tripsu index degrades performance on my side (see comments).

gabyx · 2024-07-12T15:59:01Z

@cmdoret , @supermaxiste : the write call on Writer was wrong, only clippy mentioned this: -> write returns how many bytes have been written, which was ignored lol, I thought is a general error =), not all bytes being written is a possibility. So the correct function here is write_all. 👍

fix: correct all clippy errors fix: add CLI argument to setup the thread pool fix: CLI handling fix: wrong argument docs(readme): add copyright notice (#35) fix: parallelize also second pass fix: correct output in `write` -> `write_all` - `write` may fail an return the number of bytes written, which is the wrong function. fix: remove from first-pass since no benefit

gabyx · 2024-07-15T14:33:41Z

@gabyx : This PR is draft, and should be revisited once parallelization is needed with longer work on the pseudonomization function

supermaxiste · 2025-05-21T12:15:47Z

We can revisit this idea with a different approach. We can break down tripsu's pipeline as follows:

Reader Thread – reads input line-by-line and sends it to a worker pool.
Worker Pool – multiple threads process/pseudonymize triples concurrently.
Writer Thread – receives processed data and writes sequentially to output.

The crate crossbeam can help implement the approach above and basically multithread on the pseudonymization process for all triples that require it.

This approach seems scalable and it also keeps copy-on-write performance gains by simply skipping the worker pool when not needed 👍

gabyx marked this pull request as draft July 12, 2024 08:49

gabyx requested a review from cmdoret July 12, 2024 09:05

gabyx marked this pull request as ready for review July 12, 2024 09:05

gabyx force-pushed the feat/add-parallelization branch from 96868c0 to 0658cc6 Compare July 12, 2024 09:06

cmdoret changed the title ~~feat: add parallelization for first-pass~~ perf: add parallelization for first-pass Jul 12, 2024

cmdoret reviewed Jul 12, 2024

View reviewed changes

Comment thread src/pass_first.rs

Comment thread src/rules.rs

gabyx changed the title ~~perf: add parallelization for first-pass~~ perf: add parallelization Jul 12, 2024

gabyx force-pushed the feat/add-parallelization branch from 8c15fb0 to 51cd62f Compare July 15, 2024 14:32

gabyx marked this pull request as draft July 15, 2024 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: add parallelization#34

perf: add parallelization#34
gabyx wants to merge 1 commit into
mainfrom
feat/add-parallelization

gabyx commented Jul 12, 2024 •

edited

Loading

Uh oh!

cmdoret left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

gabyx commented Jul 12, 2024 •

edited

Loading

Uh oh!

gabyx commented Jul 15, 2024

Uh oh!

supermaxiste commented May 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gabyx commented Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed Changes

Types of Changes

Checklist

Further Comments

Uh oh!

cmdoret left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gabyx commented Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabyx commented Jul 15, 2024

Uh oh!

supermaxiste commented May 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gabyx commented Jul 12, 2024 •

edited

Loading

cmdoret left a comment •

edited

Loading

gabyx commented Jul 12, 2024 •

edited

Loading