Skip to content

fix: some replicas never start manaul compact after zero o'clock #1556

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

ninsmiracle
Copy link
Contributor

@ninsmiracle ninsmiracle commented Jul 4, 2023

What problem does this PR solve?

#1479

What is changed and how does it work?

In the existing logic, if the number of replicas for a particular table undergoing manual_compact on the current node exceeds the limit, the LPC_MANUAL_COMPACT task will be discarded once it enters the task queue. In the new logic, the task will be delayed by 60 seconds before being reinserted into the queue.

In the existing logic, the logic that filters whether a replica should be added to the queue can prevent tasks that have already undergone compaction on the same day from repeatedly entering the queue. Therefore, a simple modification to the task dequeue logic can ensure that replicas that should undergo compaction will not unexpectedly skip compaction due to the calculation at zero o'clock.

Checklist

Tests
  • Manual test (add detailed scripts or steps below)
    I create a pegasus app,and set manual compact time to 23:59. So that some replica have to do manual compact after zero o'clock.

I got result like this:

33872:D2023-07-03 23:59:15.411 (1688399955411071211 15114) replica.compact0.04040001000001b4: pegasus_manual_compact_service.cpp:320:begin_manual_compact(): [[email protected]:37801] start to execute manual compaction
34071:D2023-07-04 00:00:15.411 (1688400015411226976 15118) replica.compact3.0407000100000001: pegasus_manual_compact_service.cpp:320:begin_manual_compact(): [[email protected]:37801] start to execute manual compaction
34249:D2023-07-04 00:01:15.411 (1688400075411318261 15117) replica.compact2.0407000400000002: pegasus_manual_compact_service.cpp:320:begin_manual_compact(): [[email protected]:37801] start to execute manual compaction
34459:D2023-07-04 00:02:15.411 (1688400135411507719 15120) replica.compact5.0407000500000003: pegasus_manual_compact_service.cpp:320:begin_manual_compact(): [[email protected]:37801] start to execute manual compaction
34609:D2023-07-04 00:03:15.411 (1688400195411521826 15117) replica.compact2.0407000600000004: pegasus_manual_compact_service.cpp:320:begin_manual_compact(): [[email protected]:37801] start to execute manual compaction
34803:D2023-07-04 00:04:15.411 (1688400255411667156 15118) replica.compact3.0407000000000004: pegasus_manual_compact_service.cpp:320:begin_manual_compact(): [[email protected]:37801] start to execute manual compaction
34982:D2023-07-04 00:05:15.411 (1688400315411802093 15122) replica.compact7.0407000500000004: pegasus_manual_compact_service.cpp:320:begin_manual_compact(): [[email protected]:37801] start to execute manual compaction
35157:D2023-07-04 00:06:15.411 (1688400375411913107 15116) replica.compact1.0407000200000003: pegasus_manual_compact_service.cpp:320:begin_manual_compact(): [[email protected]:37801] start to execute manual compaction
35362:D2023-07-04 00:07:15.412 (1688400435412059719 15118) replica.compact3.0407000000000006: pegasus_manual_compact_service.cpp:320:begin_manual_compact(): [[email protected]:37801] start to execute manual compaction
35515:D2023-07-04 00:08:15.412 (1688400495412150909 15119) replica.compact4.0407000200000004: pegasus_manual_compact_service.cpp:320:begin_manual_compact(): [[email protected]:37801] start to execute manual compaction
-bash-4.4$ grep -rn "end_manual_compact" *
33896:D2023-07-03 23:59:15.481 (1688399955481189237 15114) replica.compact0.04040001000001b4: pegasus_manual_compact_service.cpp:328:end_manual_compact(): [[email protected]:37801] finish to execute manual compaction, time_used = 45ms
34094:D2023-07-04 00:00:15.498 (1688400015498819014 15118) replica.compact3.0407000100000001: pegasus_manual_compact_service.cpp:328:end_manual_compact(): [[email protected]:37801] finish to execute manual compaction, time_used = 60ms
34272:D2023-07-04 00:01:15.488 (1688400075488255418 15117) replica.compact2.0407000400000002: pegasus_manual_compact_service.cpp:328:end_manual_compact(): [[email protected]:37801] finish to execute manual compaction, time_used = 45ms
34479:D2023-07-04 00:02:15.503 (1688400135503458132 15120) replica.compact5.0407000500000003: pegasus_manual_compact_service.cpp:328:end_manual_compact(): [[email protected]:37801] finish to execute manual compaction, time_used = 57ms
34631:D2023-07-04 00:03:15.483 (1688400195483022456 15117) replica.compact2.0407000600000004: pegasus_manual_compact_service.cpp:328:end_manual_compact(): [[email protected]:37801] finish to execute manual compaction, time_used = 45ms
34824:D2023-07-04 00:04:15.494 (1688400255494586453 15118) replica.compact3.0407000000000004: pegasus_manual_compact_service.cpp:328:end_manual_compact(): [[email protected]:37801] finish to execute manual compaction, time_used = 56ms
35002:D2023-07-04 00:05:15.515 (1688400315515958178 15122) replica.compact7.0407000500000004: pegasus_manual_compact_service.cpp:328:end_manual_compact(): [[email protected]:37801] finish to execute manual compaction, time_used = 76ms
35176:D2023-07-04 00:06:15.486 (1688400375486139793 15116) replica.compact1.0407000200000003: pegasus_manual_compact_service.cpp:328:end_manual_compact(): [[email protected]:37801] finish to execute manual compaction, time_used = 50ms
35380:D2023-07-04 00:07:15.487 (1688400435487399087 15118) replica.compact3.0407000000000006: pegasus_manual_compact_service.cpp:328:end_manual_compact(): [[email protected]:37801] finish to execute manual compaction, time_used = 48ms
35532:D2023-07-04 00:08:15.508 (1688400495508893878 15119) replica.compact4.0407000200000004: pegasus_manual_compact_service.cpp:328:end_manual_compact(): [[email protected]:37801] finish to execute manual compaction, time_used = 67ms

As shown in the above log, the adjusted compaction results are as expected. Replicas that had compact operations scheduled after midnight were successfully processed. Furthermore, there were no replicas that underwent compaction multiple times, ensuring a non-repetitive compaction process.

@github-actions github-actions bot added the cpp label Jul 4, 2023
@acelyc111 acelyc111 changed the title fix:some replica never start manaul compact after zero o'clock fix: some replicas never start manaul compact after zero o'clock Jul 4, 2023
@acelyc111
Copy link
Member

acelyc111 commented Jul 4, 2023

the LPC_MANUAL_COMPACT task will be discarded once it enters the task queue. In the new logic, the task will be delayed by 60 seconds before being reinserted into the queue.

If the current task has been discarded, it will be triggered in the next config_sync (the invoke link is replica::on_config_sync()->replica::update_app_envs()->pegasus_server_impl::update_app_envs()->pegasus_manual_compact_service::start_manual_compact_if_needed()->dsn::tasking::enqueue the compact task), the task will not be lost IMO.

Is the replica::on_config_sync worked as expected?

@ninsmiracle
Copy link
Contributor Author

ninsmiracle commented Jul 5, 2023

Member

This is exactly the problem. In pegasus_manual_compact_service::start_manual_compact_if_needed(), it need to use check_periodic_compact() to judge the trigger_time. There are some small problems in the calculation of trigger_time time in this way. As the following code see,the preset hhmm time needs to add unix_sec_today_midnight.

inline int64_t hh_mm_today_to_unix_sec(string_view hhmm_of_day)
{
    int sec_of_day = hh_mm_to_seconds(hhmm_of_day);
    if (sec_of_day == -1) {
        return -1;
    }
    return get_unix_sec_today_midnight() + sec_of_day;
}

If the running time of this function is greater than 0 o'clock, trigger_time will be counted to the next day of the expect day.
In my opinion, even if replica::on_config_sync run as our expection, if there are manual_compact tasks have to across 0 o'clock, some replicas will not be able to get compact_rule through check_periodic_compact() function. In this way, once the task was discarded ,after pop from the task queue before 0 o'clock, it will not enter the task queue again.

@acelyc111
Copy link
Member

start_manual_compact_if_needed

I see, thanks for your clarifying!

dsn::tasking::enqueue(LPC_MANUAL_COMPACT,
&_app->_tracker,
[this, options]() {
_pfc_manual_compact_enqueue_count->decrement();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to increment before enqueue too?


// bug fix : https://github.com/apache/incubator-pegasus/issues/1479
// now_timestamp return dsn_now_ms()
int loop_enqueue_time = now_timestamp() + 60 * 1000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The execute time is 60 seconds later, the enqueue time is now, right?

ninsmiracle and others added 2 commits July 6, 2023 12:02
},
0,
std::chrono::seconds(60));
LOG_INFO_PREFIX("retry 60 seconds later,now task enqueue time({})ms",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question is: when to give up retrying?
Suppose a rare case, there are 24 replicas of a table on a server, each one cost more than 1 hour to manual compact, will the queue increase infinity?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants