Skip to content

Flush insert buffer based on time #1519

@ilongin

Description

@ilongin

User can set batch_size in DataChain.settings() which affects inserting rows, e.g on UDF compute. Before checkpoints this was not that important as in an event of failure everything will be re-computed from the scratch anyway on the job re-run.
With UDF checkpoints, user can now resume calculating UDF from where error happened and it's important to insert as many rows as possible BEFORE failure i.e. we don't want to loose results that happened before error because of the batch size.
There are two issues:

  1. Different behavior between Clickhouse and SQLite implementation regarding batch size - we should think of implementing InsertBuffer for SQLite as well as that gives consistent codebase for both DBs and ability to add new strategies like time based flushing for both.
  2. We should probably introduce time based flushing since for some use cases calculation for one UDF input can take a long time and with our default batch size of 10_000 it can happen that it fails when buffer is almost full (e.g has 9999 results) which would mean user will loose a lot of processing time (e.g if processing for 1 row takes 1 second at worse it will loose almost 3 hours of computation!! ). User can set batch_size to smaller number but we should still make out of the box improvements.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions