Cleanup opensearch output by Pablu23 · Pull Request #947 · fkie-cad/Logprep

Pablu23 · 2026-03-17T13:16:34Z

Description

Cleanup and optimize Opensearch async output

Assignee

The changes adhere to the contribution guidelines
I have performed a self-review of my code
My changes generate no new warnings (e.g. flake8/mypy/pytest/...) other than deprecations
Change base branch from poc-mainloop to main, after merge from mainloop

Documentation

README.md is up-to-date
CHANGELOG.md is up-to-date
Documentation (doc/source) is up-to-date
Preview for docs looks fine
Notebooks are up-to-date and functional
Examples are up-to-date and functional (including compose & k8s)

Code Quality

Patch test coverage > 95% and does not decrease
New code uses correct & specific type hints

How did you verify that the changes work in practice?

List of (preferably easy reproducible) tests including OS

Reviewer

The code is readable/maintainable and follows the contribution guidelines
Documentation is up-to-date
Tests (unit & acceptance) are meaningful and not over-mocked

The rendered docs for this PR can be found here.

Pablu23 · 2026-03-19T15:18:56Z

logprep/ng/connector/opensearch/output.py

        actions = (event.data for event in events)

        index = 0
-        async for success, item in helpers.async_streaming_bulk(client, actions, **kwargs):  # type: ignore
-            if index >= len(events):
-                break
+        async for success, item in helpers.async_streaming_bulk(client, actions, **kwargs):

-            event = events[index]
-            event.state.current_state = EventStateType.STORING_IN_OUTPUT
+            # This should not be possible!
+            assert index < len(events)


bulk_id = uuid.uuid4() actions = {**event.data, "_id": f"{bulk_id}_{index}"} for index, event in enumerate(events) index = 0 async for success, item in helpers.async_streaming_bulk(client, actions, **kwargs): # This should not be possible! assert index < len(events) assert index == int(item["create"]["_id"][37:])

This proofs that helpers.async_streaming_bulk keeps the order for actions and yield iteration the same

This could possibly stay in the code, but I dont like generating a uuid here and setting it as an id, if opensearch probably does that more performantly

mhoff

Many thanks for your work. Here the few comments we already discussed

mhoff · 2026-03-20T13:42:31Z

logprep/ng/connector/opensearch/output.py

-        async for success, item in helpers.async_streaming_bulk(client, actions, **kwargs):  # type: ignore
-            if index >= len(events):
-                break
+        async for success, item in helpers.async_streaming_bulk(client, actions, **kwargs):


# "queue_size": self.config.queue_size, # "thread_count": self.config.thread_count,

Please remove and describe that it's not used in the docs

mhoff · 2026-03-20T13:43:10Z

logprep/ng/connector/opensearch/output.py

            # parallel_bulk often returned item that allowed item.get("_op_type")
            # streaming_bulk usually returns {"index": {...}} / {"create": {...}}
-            op_type = item.get("_op_type") if isinstance(item, dict) else None
-            if not op_type and isinstance(item, dict) and item:
+            op_type = self.config.default_op_type
+            if "_op_type" in item:
+                op_type = item["_op_type"]
+            elif isinstance(item, dict):
                op_type = next(iter(item.keys()))


Please simplify this code as we are only using the async_streaming_bulk interface right now and don't need the downwards compatibility

mhoff · 2026-03-20T13:51:10Z

logprep/ng/connector/opensearch/output.py

-                    error_info = (
-                        item.get(op_type, {}) if isinstance(item.get(op_type), dict) else {}
-                    )
+            if op_type in item and isinstance(item[op_type], dict):


We can statically assume item to be a dict which can simplify this code quite a bit

mhoff · 2026-03-20T13:52:14Z

logprep/ng/connector/opensearch/output.py

-        async for success, item in helpers.async_streaming_bulk(client, actions, **kwargs):  # type: ignore
-            if index >= len(events):
-                break
+        async for success, item in helpers.async_streaming_bulk(client, actions, **kwargs):


Please add a follow-up ticket for us that we might want send the chunks concurrently in the future, depending on where we identify actual performance bottlenecks

Cleanup opensearch output bulk function, and add shut_down

9352ebd

Pablu23 self-assigned this Mar 17, 2026

Pablu23 commented Mar 19, 2026

View reviewed changes

mhoff requested changes Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup opensearch output#947

Cleanup opensearch output#947
Pablu23 wants to merge 1 commit intopoc-mainloopfrom
feat-async-output

Pablu23 commented Mar 17, 2026 •

edited

Loading

Uh oh!

Pablu23 Mar 19, 2026

Uh oh!

Pablu23 Mar 19, 2026

Uh oh!

mhoff left a comment

Uh oh!

mhoff Mar 20, 2026

Uh oh!

mhoff Mar 20, 2026

Uh oh!

mhoff Mar 20, 2026

Uh oh!

mhoff Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Pablu23 commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Assignee

Documentation

Code Quality

How did you verify that the changes work in practice?

Reviewer

Uh oh!

Pablu23 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Pablu23 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

mhoff left a comment

Choose a reason for hiding this comment

Uh oh!

mhoff Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

mhoff Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

mhoff Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

mhoff Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Pablu23 commented Mar 17, 2026 •

edited

Loading