Implement in-memory Service Dependency Graph using Apache Beam

For background, see https://github.com/jaegertracing/jaeger/issues/5910

Jaeger `all-in-one` typically runs with `in-memory` or `badger` storage that both have a special implementation of Dependencies Storage API where instead of pre-computing and storing the dependencies they just brute-force re-calculate them on demand each time:
  * https://github.com/jaegertracing/jaeger/blob/9a30dfc15df4e6712cf02344acae18256781153a/plugin/storage/badger/dependencystore/storage.go#L27
  * https://github.com/jaegertracing/jaeger/blob/9a30dfc15df4e6712cf02344acae18256781153a/plugin/storage/memory/memory.go#L85

It's ok for small demos, but:
  * if all-in-one is run on a large machine and is allowed to store a lot of traces, this brute-force recalculation could be very slow
  * the two implementations are completely independent, and different from the Spark implementation, so we have 3 different copies of the code to maintain

Following on the proposal from RFC #5910, we could re-implement this logic as an in-process streaming component using Apache Beam with `direct` executor. This will allow us to consolidate the graph building logic across `memory` and `badger` storages (in fact extract it from them into an independent component), and in the future we can find a way to adapt it to run in a distributed manner on big data runners without actually changing the business logic.

Some implementation details:
  * The logic will be running as a dedicated `processor` in the OTEL pipeline (similar to [adaptive sampling processor](https://github.com/jaegertracing/jaeger/blob/main/cmd/jaeger/internal/processors/adaptivesampling/processor.go))
  * The processor will accumulate dependency graph from a stream of traces it receives and periodically write them to dependencies storage (it can obtain the storage from `jaeger_storage` extension)
  * The processor will need to perform a trace aggregation similar to [tail sampling processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor) by grouping spans by trace ID and waiting for a period of inactivity after which the trace can be declared "complete" (i.e. fully received, assembled, and ready for processing). It would be interesting if such functionality can be abstracted out of tail sampling processor. Unlike tail sampler, the aggregator for dependencies does not need to keep full spans in memory, because (a) if it really needs them it can get them from  SpanReader, and (b) it can just keep a more lightweight structure of only span IDs, span kinds, and their parent-child relationships - this will be useful once that logic starts running on a big data pipeline.

Steps:
* [ ] build a processor performing streaming aggregation for the basic service map (using the logic that already exists in the memory store)
* [ ] implement dependencies storage in the memory store (the GetDependency could be controlled by a feature flag allowing switching between current behavior and the new behavior that just reads the data)
* [ ] hook-up the streaming processor to write to storage
* [ ] verify the behavior via existing integration test for dependencies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement in-memory Service Dependency Graph using Apache Beam #5911

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement in-memory Service Dependency Graph using Apache Beam #5911

Description

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions