Skip to content

[Feature Request]: Properly incorporate bulk multimap side input reads into caching. #34149

Open
@robertwb

Description

@robertwb

What would you like to happen?

Currently it simply stores the first 100 values, optimizing for avoiding point lookups for very small maps. The proper way to do this would probably be to issue the state request itself from MultimapSideInput.get() in the bulk side input read block, iterate over the returned key-value-iterables, and add the keys and their corresponding (weighted) value iterables to the cache one at a time.

We would probably also want to store some state indicating whether the bulk-reading was already attempted, as well as (if the set of returned values was the entire map) the set of keys (or at least a bloom filter) such that we can return quickly with the empty iterable for those keys that we have discovered are not actually in the map (distinguishing from the case of a key having been evicted from the cache).

Alternatively, we could store the map (possibly of the first page alone) in cache as a single entry.

Issue Priority

Priority: 3 (nice-to-have improvement)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions