-
Notifications
You must be signed in to change notification settings - Fork 134
Open
Description
User can set batch_size in DataChain.settings() which affects inserting rows, e.g on UDF compute. Before checkpoints this was not that important as in an event of failure everything will be re-computed from the scratch anyway on the job re-run.
With UDF checkpoints, user can now resume calculating UDF from where error happened and it's important to insert as many rows as possible BEFORE failure i.e. we don't want to loose results that happened before error because of the batch size.
There are two issues:
- Different behavior between Clickhouse and SQLite implementation regarding batch size - we should think of implementing
InsertBufferfor SQLite as well as that gives consistent codebase for both DBs and ability to add new strategies like time based flushing for both. - We should probably introduce time based flushing since for some use cases calculation for one UDF input can take a long time and with our default batch size of 10_000 it can happen that it fails when buffer is almost full (e.g has 9999 results) which would mean user will loose a lot of processing time (e.g if processing for 1 row takes 1 second at worse it will loose almost 3 hours of computation!! ). User can set
batch_sizeto smaller number but we should still make out of the box improvements.
Metadata
Metadata
Assignees
Labels
No labels