Start designing shuffling algorithm

When a stage sends its output, we want to start using that to shuffle data to downstream stages.

https://github.com/ray-project/ray_beam_runner/blob/86bfcdd5705b2e689d0aff0f02b6cf46535c88d0/ray_beam_runner/portability/execution.py#L94-L108

Example of shuffle implementation for Ray Datasets 1.13: https://github.com/ray-project/ray/pull/23758


	for output in worker_handler.data_conn.input_elements(
	process_bundle_id,
	expect_reads,
	abort_callback=lambda:
	(result_future.is_done() and bool(result_future.get().error))):
	if isinstance(output, beam_fn_api_pb2.Elements.Timers) and not dry_run:
	output_buffers[expected_outputs[(output.transform_id, output.timer_family_id)]].append(output.data)
	if isinstance(output, beam_fn_api_pb2.Elements.Data) and not dry_run:
	output_buffers[expected_outputs[output.transform_id]].append(output.data)

	for pcoll, buffer in output_buffers.items():
	objrefs = [ray.put(buffer)]
	runner_context.pcollection_buffers.put.remote(pcoll, objrefs)
	output_buffers[pcoll] = objrefs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start designing shuffling algorithm #26

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Start designing shuffling algorithm #26

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions