-
Notifications
You must be signed in to change notification settings - Fork 138
feat: Enable Stats on Resumed Sync #558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: staging
Are you sure you want to change the base?
Conversation
|
Hi @vaibhav-datazip , hope y are having a great day.... so I'm a bit unsure how to test the e2e resume behaviour locally (I can run the |
|
Also... I didn't implement the change for the Oracle driver yet....cuz it had a TODO for counting rows/find chunks and i wanted to confirm expected approach before touching Oracle-specific SQL. |
|
Hi @Itz-Agasta , Also, for this you need to have good amount of rows in your table so that sync takes few minutes to get finished. And just wanted to remind, please raise your pr keeping staging branch as base. |
Sorry, my bad... I’ve updated it. (i should have rebase it tough 😢)
cool i will let you know after testing it. And what about the oracle driver....do you think I’m heading in the right direction? |
|
No problem, @Itz-Agasta |
|
Hi @Itz-Agasta , |
By tomorrow, i have implemented it and just have to perform a e2e test and check for any unexpected bhv. |
2025-10-07.19-31-55.mp4Hi @vaibhav-datazip, i have tried the e2e test as y told me. I followed the docs https://olake.io/docs/community/setting-up-a-dev-env and, you can see:
Can y pls check it out now? Thanks |
|
Hi @Itz-Agasta , maybe my explanation was not that clear.
|
Okkkk I seee ... I get it now i will test it by tommorow and let y know |
|
@vaibhav-datazip 1 more question ...if i stoped the debugger. How am I suppose to restart the sync using the state file... Any cmds that I can run with the '-- state' flag ? I mean
|
|
@Itz-Agasta
to further help yourself out , you can use the debugger file mentioned earlier or you can go through this doc as well , which mentions all the flags and commands which can be used in OLake. If you still have some doubt, please feel free to ask. |
Before2025-10-08.20-04-39.mp4Now2025-10-08.20-08-59.mp4 |
Hi @vaibhav-datazip , you can check this out now. One question though.... in both the previous and current cases, when I start the sync with --state, the |
|
records are written in parquet files in chunks. so if possible, for testing , you can decrease the chunk size to 1000 and try it out . currently the batch size is set such that size of parquet file generated becomes 256 mb , you can find this value in constants.go, if you are unable to set batch size to 1k, try decreasing this size to 1mb and you will see some changes in stats as well |
Yep its resuming form where it left now 2025-10-09.12-53-53.mp4But why with --state, its exceding the total record limit :( |
|
So, the stats of total records synced in stats file is representing records in batch of 10k, These are yet to be committed. once they will be committed the user can see them in their destination. But if before this whole chunk (the parent unit of batch) fails to sync, whole chunk will be synced again . That's the reason why you are seeing records exceeding total number of records. An easy way to solve this is, don't update the state file based on this total records synced from stats, instead update that only when one whole chunk gets committed. In that way we will have correct numbers when the sync is resumed again |
|
@Itz-Agasta can you check my previous comment and do the necessary changes |
…nd drivers/postgres/internal/backfill.go
Yep.. i have resolved them,,, you can review it now :) |
|
@vaibhav-datazip i have refactor the pr as y asked.... i think its good to go now .. pls check it. |
|
@vaibhav-datazip done.. y can check it now |
|
@Itz-Agasta I tried testing this using postgres , ran the sync by changing chunk size and stopped it. When trying to re run sync with state I am unable to see estimated remaining time. I have attached video for your reference. Screen.Recording.2025-10-28.at.4.06.10.PM.movAlso, instead of using 2 variables total records and synced records, you can just remaining records variable. Because that is used to check remaining time. |
ok you mean,,,,,instead of storing
|
|
Hi @vaibhav-datazip , I implemented your suggestion....So, Instead of tracking I added three methods:
Here is the final result- 2025-10-30.01-41-03.mp4You can test it now ... |
| } | ||
|
|
||
| connector.SetupState(state) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After implementing the prv changes, I tested it and... still got "Not Determined" :(
So I started debugging. I added log statements everywhere to trace what was happening. That's when I discovered the timing issue in sync.go (probably):
The stats logger was starting BEFORE the remaining records were loaded from state
Ig-
- Logger starts -> checks pool stats -> finds
TotalRecordsToSync = 0-> shows "Not Determined" - Then backfill loads state and adds records to pool stats
- But logger already decided there's no data
Thats why I added pre-load logic in sync.go that runs BEFORE starting the logger. It iterates through all streams (FullLoad, CDC, Incremental), loads their remaining counts from state, and adds them to pool stats.
| } | ||
| if countFloat64, ok := count.(float64); ok { | ||
| return int64(countFloat64) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one took me a while to figure out. I added more logging and noticed something weird - even though the state file had "remaining_record_count": 377888, when I read it back, it was coming out as 0!
I traced through the code and found the issue in GetRemainingRecordCount():
if count, loaded := s.Streams[index].State.Load(RemainingRecordCountKey); loaded {
if countInt64, ok := count.(int64); ok {
return countInt64 // This was failing!
}
}
return 0Ig Go's JSON unmarshaling converts ALL numbers to float64 by default, not their original types....So when we saved remaining_record_count as int64, it came back from JSON as float64. The type assertion count.(int64) was failing silently, and we were returning 0.
|
@Itz-Agasta , I tried testing it again. when I resumed the sync after stopping it, the remaining records count didn't reduce , as you can see in the video attached. Screen.Recording.2025-10-30.at.4.11.31.PM.mov |
Ahhhh. it was the same JSON type mismatch bug, but in |
|
@vaibhav-datazip please check this e2e test... i hope its perfect now. 2025-11-01.19-58-59.mp4 |
|
@Itz-Agasta , will be testing and reviewing soon |
Description
This PR implements stat persistence in the state object, enabling accurate progress tracking and time estimation when resuming syncs. Previously, when a sync was interrupted and resumed, all progress metrics (total records, synced count, estimated time) were lost, leading to poor user experience and observability.
Closes #110
Type of change
Changes
1. Added New State Tracking (types/state.go)
I added 4 functions that let it save and load these statistics:
SetTotalRecordCount()- Saves the total number of recordsGetTotalRecordCount()- Loads the total number of recordsSetSyncedRecordCount()- Saves how many records are doneGetSyncedRecordCount()- Loads how many records are done2. Updated Database Drivers
update Mongo, MySQL, and PostgreSQL so it:
3. Updated Resume Logic
Now when soemone resume a sync, the system:
How Has This Been Tested?