fix: some replicas never start manaul compact after zero o'clock #1556

ninsmiracle · 2023-07-04T03:09:37Z

What problem does this PR solve?

What is changed and how does it work?

In the existing logic, if the number of replicas for a particular table undergoing manual_compact on the current node exceeds the limit, the LPC_MANUAL_COMPACT task will be discarded once it enters the task queue. In the new logic, the task will be delayed by 60 seconds before being reinserted into the queue.

In the existing logic, the logic that filters whether a replica should be added to the queue can prevent tasks that have already undergone compaction on the same day from repeatedly entering the queue. Therefore, a simple modification to the task dequeue logic can ensure that replicas that should undergo compaction will not unexpectedly skip compaction due to the calculation at zero o'clock.

Checklist

Tests

Manual test (add detailed scripts or steps below)
I create a pegasus app,and set manual compact time to 23:59. So that some replica have to do manual compact after zero o'clock.

I got result like this:

33872:D2023-07-03 23:59:15.411 (1688399955411071211 15114) replica.compact0.04040001000001b4: pegasus_manual_compact_service.cpp:320:begin_manual_compact(): [[email protected]:37801] start to execute manual compaction
34071:D2023-07-04 00:00:15.411 (1688400015411226976 15118) replica.compact3.0407000100000001: pegasus_manual_compact_service.cpp:320:begin_manual_compact(): [[email protected]:37801] start to execute manual compaction
34249:D2023-07-04 00:01:15.411 (1688400075411318261 15117) replica.compact2.0407000400000002: pegasus_manual_compact_service.cpp:320:begin_manual_compact(): [[email protected]:37801] start to execute manual compaction
34459:D2023-07-04 00:02:15.411 (1688400135411507719 15120) replica.compact5.0407000500000003: pegasus_manual_compact_service.cpp:320:begin_manual_compact(): [[email protected]:37801] start to execute manual compaction
34609:D2023-07-04 00:03:15.411 (1688400195411521826 15117) replica.compact2.0407000600000004: pegasus_manual_compact_service.cpp:320:begin_manual_compact(): [[email protected]:37801] start to execute manual compaction
34803:D2023-07-04 00:04:15.411 (1688400255411667156 15118) replica.compact3.0407000000000004: pegasus_manual_compact_service.cpp:320:begin_manual_compact(): [[email protected]:37801] start to execute manual compaction
34982:D2023-07-04 00:05:15.411 (1688400315411802093 15122) replica.compact7.0407000500000004: pegasus_manual_compact_service.cpp:320:begin_manual_compact(): [[email protected]:37801] start to execute manual compaction
35157:D2023-07-04 00:06:15.411 (1688400375411913107 15116) replica.compact1.0407000200000003: pegasus_manual_compact_service.cpp:320:begin_manual_compact(): [[email protected]:37801] start to execute manual compaction
35362:D2023-07-04 00:07:15.412 (1688400435412059719 15118) replica.compact3.0407000000000006: pegasus_manual_compact_service.cpp:320:begin_manual_compact(): [[email protected]:37801] start to execute manual compaction
35515:D2023-07-04 00:08:15.412 (1688400495412150909 15119) replica.compact4.0407000200000004: pegasus_manual_compact_service.cpp:320:begin_manual_compact(): [[email protected]:37801] start to execute manual compaction
-bash-4.4$ grep -rn "end_manual_compact" *
33896:D2023-07-03 23:59:15.481 (1688399955481189237 15114) replica.compact0.04040001000001b4: pegasus_manual_compact_service.cpp:328:end_manual_compact(): [[email protected]:37801] finish to execute manual compaction, time_used = 45ms
34094:D2023-07-04 00:00:15.498 (1688400015498819014 15118) replica.compact3.0407000100000001: pegasus_manual_compact_service.cpp:328:end_manual_compact(): [[email protected]:37801] finish to execute manual compaction, time_used = 60ms
34272:D2023-07-04 00:01:15.488 (1688400075488255418 15117) replica.compact2.0407000400000002: pegasus_manual_compact_service.cpp:328:end_manual_compact(): [[email protected]:37801] finish to execute manual compaction, time_used = 45ms
34479:D2023-07-04 00:02:15.503 (1688400135503458132 15120) replica.compact5.0407000500000003: pegasus_manual_compact_service.cpp:328:end_manual_compact(): [[email protected]:37801] finish to execute manual compaction, time_used = 57ms
34631:D2023-07-04 00:03:15.483 (1688400195483022456 15117) replica.compact2.0407000600000004: pegasus_manual_compact_service.cpp:328:end_manual_compact(): [[email protected]:37801] finish to execute manual compaction, time_used = 45ms
34824:D2023-07-04 00:04:15.494 (1688400255494586453 15118) replica.compact3.0407000000000004: pegasus_manual_compact_service.cpp:328:end_manual_compact(): [[email protected]:37801] finish to execute manual compaction, time_used = 56ms
35002:D2023-07-04 00:05:15.515 (1688400315515958178 15122) replica.compact7.0407000500000004: pegasus_manual_compact_service.cpp:328:end_manual_compact(): [[email protected]:37801] finish to execute manual compaction, time_used = 76ms
35176:D2023-07-04 00:06:15.486 (1688400375486139793 15116) replica.compact1.0407000200000003: pegasus_manual_compact_service.cpp:328:end_manual_compact(): [[email protected]:37801] finish to execute manual compaction, time_used = 50ms
35380:D2023-07-04 00:07:15.487 (1688400435487399087 15118) replica.compact3.0407000000000006: pegasus_manual_compact_service.cpp:328:end_manual_compact(): [[email protected]:37801] finish to execute manual compaction, time_used = 48ms
35532:D2023-07-04 00:08:15.508 (1688400495508893878 15119) replica.compact4.0407000200000004: pegasus_manual_compact_service.cpp:328:end_manual_compact(): [[email protected]:37801] finish to execute manual compaction, time_used = 67ms

As shown in the above log, the adjusted compaction results are as expected. Replicas that had compact operations scheduled after midnight were successfully processed. Furthermore, there were no replicas that underwent compaction multiple times, ensuring a non-repetitive compaction process.

acelyc111 · 2023-07-04T07:22:17Z

the LPC_MANUAL_COMPACT task will be discarded once it enters the task queue. In the new logic, the task will be delayed by 60 seconds before being reinserted into the queue.

If the current task has been discarded, it will be triggered in the next config_sync (the invoke link is replica::on_config_sync()->replica::update_app_envs()->pegasus_server_impl::update_app_envs()->pegasus_manual_compact_service::start_manual_compact_if_needed()->dsn::tasking::enqueue the compact task), the task will not be lost IMO.

Is the replica::on_config_sync worked as expected?

ninsmiracle · 2023-07-05T03:03:25Z

Member

This is exactly the problem. In pegasus_manual_compact_service::start_manual_compact_if_needed(), it need to use check_periodic_compact() to judge the trigger_time. There are some small problems in the calculation of trigger_time time in this way. As the following code see,the preset hhmm time needs to add unix_sec_today_midnight.

inline int64_t hh_mm_today_to_unix_sec(string_view hhmm_of_day)
{
    int sec_of_day = hh_mm_to_seconds(hhmm_of_day);
    if (sec_of_day == -1) {
        return -1;
    }
    return get_unix_sec_today_midnight() + sec_of_day;
}

If the running time of this function is greater than 0 o'clock, trigger_time will be counted to the next day of the expect day.
In my opinion, even if replica::on_config_sync run as our expection, if there are manual_compact tasks have to across 0 o'clock, some replicas will not be able to get compact_rule through check_periodic_compact() function. In this way, once the task was discarded ,after pop from the task queue before 0 o'clock, it will not enter the task queue again.

acelyc111 · 2023-07-05T05:54:29Z

start_manual_compact_if_needed

I see, thanks for your clarifying!