This the implementation of the Engine contract of Open Data Fabric using the RisingWave open-source streaming database. It is currently in use in kamu data management tool.
This repository is a fork of the RisingWave repo since it currently cannot be used as a library or extended in a modular way with ODF-specific sources and sinks.
This engine is experimental and has limited functionality.
We currently recommend it for queries like:
- Streaming aggregations using window functions
- Streaming aggregations using tumbling windows with
GROUP BY - Top-N aggregations via materialized views
More information and engine comparisons are available here.
See RisingWave examples, and SQL reference for inspiration.
Have a look at integration tests in this repo for examples of different transform categories.
Pay special attention to EMIT ON WINDOW CLOSE as unlike Flink, RisingWave does not operate in event-time processing mode by default. In future we may hide this under the hood, but currently you need to specify this clause in your queries explicitly.
- No support for event-time
JOINs. Only processing-time joins are supported. There might be some workarounds depending on the use case.
See: https://docs.risingwave.com/docs/current/query-syntax-join-clause/#process-time-temporal-joins
- No support for allowed lateness intervals. Unlike Flink where lateness and watermark are separate concepts, RisingWave will drop all events below the current watermark.
See: https://docs.risingwave.com/docs/current/watermarks/
- A
SOURCEaccepts append-only data. ATABLEwith connector accepts both append-only data and updatable data, but it persists all data that flows in meaning that ODF checkpoints would grow significantly.
See: https://docs.risingwave.com/docs/current/data-ingestion/
- Requires row IDs assignment for delete/update operations.
Currently ODF does not preserve a link between a retraction and the row being retracted. This is easy to implement for ingest merge strategies via additional column, but we would need to assess if this is a reasonable demand for all transform engines that emit retractions/corrections.
An alternative would be to specify primary key for RW source and ensure it uses that to associate delete with the record being deleted.
This repository is a fork of the RisingWave engine. Please refer to the upstream repo for developer documentation.
To help with really bad build time we add this to Cargo.toml:
[profile.dev]
debug = "line-tables-only"To build the engine run:
# Use CARGO_BUILD_JOBS=N env var if this drains your RAM
./risedev bRun the tests (note that tests use fixed ports and thus must run one at a time):
cd odf
make testKey directories:
odf- build scripts, container image, and other stuffsrc/odf_adapter- gRPC server that communicates with ODF coordinator and spawns RW and a subprocesssrc/connectors/src/source/odf- custom ODF source implementationsrc/connectors/src/sink/odf- custom ODF sink implementation
Environment variables:
RUST_LOG=debug- standard Rust loggingRW_ODF_SOURCE_DEBUG=1- enables debug data logging in the sinkRW_ODF_SINK_DEBUG=1- enables debug data logging in the sinkRW_ODF_SOURCE_MAX_RECORDS_PER_CHUNK=10- makes source split arrow record batches into smaller chunksRW_ODF_SOURCE_SLEEP_CHUNK_MS=500- makes source sleep some time before yielding a chunk