Flush insert buffer based on time

User can set `batch_size` in `DataChain.settings()` which affects inserting rows, e.g on UDF compute. Before checkpoints this was not that important as in an event of failure everything will be re-computed from the scratch anyway on the job re-run.
With UDF checkpoints, user can now resume calculating UDF from where error happened and it's important to insert as many rows as possible BEFORE failure i.e. we don't want to loose results that happened before error because of the batch size.
There are two issues:
1. Different behavior between Clickhouse and SQLite implementation regarding batch size - we should think of implementing `InsertBuffer` for SQLite as well as that gives consistent codebase for both DBs and ability to add new strategies like time based flushing for both.
2. We should probably introduce time based flushing since for some use cases calculation for one UDF input can take a long time and with our default batch size of 10_000 it can happen that it fails when buffer is almost full (e.g has 9999 results) which would mean user will loose a lot of processing time (e.g if processing for 1 row takes 1 second at worse it will loose almost 3 hours of computation!! ). User can set `batch_size` to smaller number but we should still make out of the box improvements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flush insert buffer based on time #1519

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Flush insert buffer based on time #1519

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions