Skip to content

[fix](replica num) Fix the decrease in the number of replicas and une… #48704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 12, 2025

Conversation

deardeng
Copy link
Contributor

@deardeng deardeng commented Mar 5, 2025

…ven distribution of replicas among bes

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

When reducing the number of table replicas, the decision to drop replicas is currently based on the load situation of the BE nodes. However, this approach can result in the node with high BE load dropping many replicas at the same time, leading to severe CPU imbalance in the BE cluster.

The fix is to count the distributed tablet numbers on each BE when altering the number of replicas. Based on this mapping, we will determine which BEs to drop replicas from.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Mar 5, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@deardeng deardeng marked this pull request as draft March 5, 2025 11:43
@deardeng deardeng marked this pull request as ready for review March 9, 2025 17:08
@deardeng
Copy link
Contributor Author

deardeng commented Mar 9, 2025

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32472 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit fb243de499ae8095a530b6ae44d9e9da9b20e9f6, data reload: false

------ Round 1 ----------------------------------
q1	17604	5161	5097	5097
q2	2054	287	167	167
q3	10449	1252	765	765
q4	10226	1037	534	534
q5	7609	2377	2327	2327
q6	196	168	143	143
q7	894	747	598	598
q8	9309	1230	1062	1062
q9	4921	4840	4795	4795
q10	6814	2285	1898	1898
q11	473	278	258	258
q12	340	357	214	214
q13	17760	3671	3073	3073
q14	228	229	221	221
q15	539	494	478	478
q16	618	617	587	587
q17	556	848	342	342
q18	6739	6444	6309	6309
q19	1358	959	543	543
q20	331	317	189	189
q21	2788	2098	1915	1915
q22	1071	984	957	957
Total cold run time: 102877 ms
Total hot run time: 32472 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5134	5102	5138	5102
q2	233	331	227	227
q3	2155	2648	2281	2281
q4	1482	1913	1460	1460
q5	4257	4107	4143	4107
q6	206	164	122	122
q7	1904	1922	1751	1751
q8	2606	2634	2554	2554
q9	7315	7447	7298	7298
q10	3089	3281	2848	2848
q11	570	516	487	487
q12	703	815	647	647
q13	3530	3971	3479	3479
q14	289	291	268	268
q15	538	476	482	476
q16	671	703	642	642
q17	1160	1605	1371	1371
q18	8114	7523	7549	7523
q19	821	750	808	750
q20	1966	2089	1882	1882
q21	5468	4998	4822	4822
q22	1113	1046	1037	1037
Total cold run time: 53324 ms
Total hot run time: 51134 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 191447 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit fb243de499ae8095a530b6ae44d9e9da9b20e9f6, data reload: false

query1	1400	1017	973	973
query2	6143	1929	1907	1907
query3	11153	4676	4601	4601
query4	54291	24909	22937	22937
query5	5293	525	487	487
query6	410	202	192	192
query7	5333	502	297	297
query8	326	250	247	247
query9	7051	2511	2508	2508
query10	431	305	253	253
query11	15651	14981	14884	14884
query12	154	112	110	110
query13	1278	512	427	427
query14	10161	7449	6252	6252
query15	206	194	188	188
query16	6473	673	494	494
query17	1064	702	550	550
query18	852	375	312	312
query19	206	189	177	177
query20	129	127	130	127
query21	207	125	103	103
query22	4565	4481	4256	4256
query23	34081	33203	33355	33203
query24	5642	2395	2446	2395
query25	472	469	405	405
query26	644	281	156	156
query27	1652	490	338	338
query28	2807	2417	2432	2417
query29	577	577	415	415
query30	284	227	197	197
query31	907	871	819	819
query32	84	66	63	63
query33	453	378	305	305
query34	748	847	537	537
query35	815	836	747	747
query36	924	999	893	893
query37	114	102	81	81
query38	4298	4225	4230	4225
query39	1478	1448	1478	1448
query40	235	114	103	103
query41	51	53	48	48
query42	122	117	103	103
query43	501	516	491	491
query44	1302	783	796	783
query45	185	173	169	169
query46	851	1039	647	647
query47	1831	1917	1791	1791
query48	385	434	304	304
query49	707	516	441	441
query50	717	753	421	421
query51	4308	4270	4211	4211
query52	101	99	92	92
query53	230	257	185	185
query54	494	485	419	419
query55	93	84	84	84
query56	279	282	268	268
query57	1140	1174	1124	1124
query58	247	237	233	233
query59	2883	2802	2727	2727
query60	293	274	264	264
query61	149	145	149	145
query62	749	780	680	680
query63	227	199	212	199
query64	1524	1193	701	701
query65	4604	4455	4461	4455
query66	736	392	294	294
query67	15596	15591	15352	15352
query68	8407	916	499	499
query69	551	299	274	274
query70	1230	1148	1108	1108
query71	496	288	277	277
query72	5570	3585	3952	3585
query73	1120	749	345	345
query74	9410	9145	8971	8971
query75	3771	3180	2718	2718
query76	4455	1178	744	744
query77	579	364	278	278
query78	10016	10202	9283	9283
query79	1988	817	579	579
query80	860	534	508	508
query81	479	250	227	227
query82	624	123	91	91
query83	288	170	152	152
query84	279	99	68	68
query85	780	351	317	317
query86	375	297	290	290
query87	4364	4572	4433	4433
query88	2799	2210	2235	2210
query89	428	385	286	286
query90	1951	215	219	215
query91	134	146	108	108
query92	80	60	64	60
query93	1218	1030	580	580
query94	690	413	293	293
query95	349	270	269	269
query96	484	565	277	277
query97	3358	3374	3235	3235
query98	224	209	203	203
query99	1427	1381	1304	1304
Total cold run time: 298081 ms
Total hot run time: 191447 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.3 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit fb243de499ae8095a530b6ae44d9e9da9b20e9f6, data reload: false

query1	0.04	0.03	0.03
query2	0.07	0.04	0.05
query3	0.24	0.06	0.06
query4	1.61	0.10	0.10
query5	0.57	0.55	0.57
query6	1.21	0.70	0.73
query7	0.02	0.02	0.02
query8	0.04	0.03	0.03
query9	0.62	0.52	0.51
query10	0.58	0.62	0.58
query11	0.15	0.10	0.10
query12	0.14	0.11	0.11
query13	0.62	0.60	0.59
query14	2.81	2.70	2.69
query15	0.93	0.84	0.85
query16	0.40	0.38	0.38
query17	1.04	1.03	1.07
query18	0.21	0.20	0.20
query19	1.88	1.80	2.01
query20	0.01	0.01	0.01
query21	15.41	0.92	0.55
query22	0.76	1.22	0.68
query23	14.91	1.37	0.60
query24	9.12	0.65	0.29
query25	0.29	0.14	0.09
query26	0.37	0.16	0.14
query27	0.06	0.05	0.05
query28	8.12	0.86	0.43
query29	12.87	3.93	3.29
query30	0.25	0.09	0.07
query31	2.83	0.60	0.38
query32	3.23	0.54	0.47
query33	2.98	2.99	3.02
query34	15.93	5.13	4.55
query35	4.58	4.58	4.52
query36	0.67	0.49	0.49
query37	0.09	0.07	0.06
query38	0.05	0.04	0.04
query39	0.03	0.03	0.03
query40	0.17	0.14	0.13
query41	0.08	0.03	0.03
query42	0.04	0.03	0.02
query43	0.03	0.03	0.03
Total cold run time: 106.06 s
Total hot run time: 30.3 s

Copy link
Contributor

@yujun777 yujun777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need update

@deardeng
Copy link
Contributor Author

run buildall

@deardeng
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32381 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit d472b7b2cfeb0530151d9baeaaee8966ac733224, data reload: false

------ Round 1 ----------------------------------
q1	17636	5128	5042	5042
q2	2040	286	158	158
q3	10425	1288	698	698
q4	10224	1018	516	516
q5	7526	2351	2330	2330
q6	186	165	131	131
q7	902	740	582	582
q8	9287	1274	1032	1032
q9	5071	4814	4703	4703
q10	6802	2309	1894	1894
q11	474	272	259	259
q12	345	348	213	213
q13	17758	3699	3106	3106
q14	227	233	209	209
q15	529	494	491	491
q16	620	614	594	594
q17	558	854	343	343
q18	6783	6583	6449	6449
q19	1216	943	539	539
q20	327	325	195	195
q21	2818	2094	1894	1894
q22	1056	1003	1010	1003
Total cold run time: 102810 ms
Total hot run time: 32381 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5152	5122	5140	5122
q2	243	322	240	240
q3	2153	2685	2295	2295
q4	1500	1817	1351	1351
q5	4219	4131	4144	4131
q6	203	165	123	123
q7	1859	1853	1842	1842
q8	2604	2611	2499	2499
q9	7328	7121	7124	7121
q10	2951	3227	2743	2743
q11	558	496	489	489
q12	693	778	618	618
q13	3435	3917	3233	3233
q14	298	287	268	268
q15	511	481	467	467
q16	676	698	649	649
q17	1125	1600	1318	1318
q18	7772	7551	7377	7377
q19	780	764	847	764
q20	1958	1993	1877	1877
q21	5450	5040	4806	4806
q22	1125	1057	1007	1007
Total cold run time: 52593 ms
Total hot run time: 50340 ms

Copy link
Contributor

@yujun777 yujun777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

PR approved by anyone and no changes requested.

@doris-robot
Copy link

TPC-DS: Total hot run time: 185112 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit d472b7b2cfeb0530151d9baeaaee8966ac733224, data reload: false

query1	1005	391	367	367
query2	6537	1962	1950	1950
query3	6800	216	219	216
query4	26803	23439	23106	23106
query5	4391	697	491	491
query6	305	214	178	178
query7	4608	497	282	282
query8	273	225	222	222
query9	8568	2622	2617	2617
query10	483	306	249	249
query11	15540	15371	14904	14904
query12	162	111	109	109
query13	1658	531	402	402
query14	8852	7245	6407	6407
query15	210	197	166	166
query16	7426	677	494	494
query17	1221	731	568	568
query18	1956	407	329	329
query19	191	189	160	160
query20	167	112	115	112
query21	207	126	104	104
query22	4008	4178	4075	4075
query23	34029	33194	32842	32842
query24	7720	2401	2365	2365
query25	526	449	391	391
query26	1231	280	156	156
query27	2097	491	336	336
query28	3914	2418	2422	2418
query29	770	560	427	427
query30	281	216	194	194
query31	947	871	779	779
query32	71	61	64	61
query33	560	356	309	309
query34	777	854	500	500
query35	804	812	762	762
query36	962	1001	911	911
query37	111	106	76	76
query38	4142	4238	4085	4085
query39	1442	1415	1367	1367
query40	215	118	104	104
query41	54	53	50	50
query42	114	107	102	102
query43	493	523	477	477
query44	1302	802	783	783
query45	174	172	168	168
query46	834	1030	623	623
query47	1740	1769	1704	1704
query48	375	418	302	302
query49	764	517	444	444
query50	670	732	417	417
query51	4226	4246	4119	4119
query52	104	109	98	98
query53	233	253	208	208
query54	488	499	400	400
query55	84	82	82	82
query56	270	270	260	260
query57	1151	1151	1063	1063
query58	249	254	236	236
query59	2486	2683	2574	2574
query60	271	273	262	262
query61	122	120	123	120
query62	786	756	662	662
query63	228	188	187	187
query64	4331	1012	689	689
query65	4493	4338	4359	4338
query66	1138	425	316	316
query67	15702	15499	15286	15286
query68	7107	863	502	502
query69	475	348	256	256
query70	1153	1120	1135	1120
query71	411	301	273	273
query72	5579	3577	3723	3577
query73	754	734	344	344
query74	9017	9272	8707	8707
query75	3165	3183	2712	2712
query76	3183	1196	763	763
query77	478	393	298	298
query78	9885	10163	9322	9322
query79	1877	900	583	583
query80	734	523	448	448
query81	513	262	222	222
query82	408	128	102	102
query83	185	178	165	165
query84	245	97	80	80
query85	836	418	379	379
query86	420	299	285	285
query87	4445	4631	4320	4320
query88	2932	2232	2252	2232
query89	391	315	283	283
query90	1908	211	215	211
query91	138	131	111	111
query92	76	61	57	57
query93	1673	1067	579	579
query94	713	414	306	306
query95	360	269	263	263
query96	483	554	278	278
query97	3273	3388	3284	3284
query98	218	204	195	195
query99	1355	1380	1294	1294
Total cold run time: 269242 ms
Total hot run time: 185112 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.75 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit d472b7b2cfeb0530151d9baeaaee8966ac733224, data reload: false

query1	0.04	0.04	0.03
query2	0.07	0.04	0.03
query3	0.23	0.07	0.07
query4	1.65	0.10	0.10
query5	0.57	0.55	0.56
query6	1.17	0.71	0.72
query7	0.02	0.02	0.02
query8	0.04	0.03	0.03
query9	0.58	0.52	0.52
query10	0.57	0.61	0.59
query11	0.16	0.11	0.11
query12	0.14	0.12	0.11
query13	0.60	0.62	0.60
query14	2.67	2.69	2.81
query15	0.92	0.84	0.85
query16	0.38	0.38	0.38
query17	1.02	1.03	1.02
query18	0.21	0.20	0.20
query19	1.92	1.81	2.00
query20	0.01	0.01	0.01
query21	15.35	0.94	0.55
query22	0.76	1.28	0.66
query23	14.82	1.39	0.62
query24	7.16	1.54	0.71
query25	0.57	0.19	0.16
query26	0.58	0.17	0.14
query27	0.05	0.04	0.04
query28	9.05	0.90	0.42
query29	12.55	4.02	3.30
query30	0.25	0.08	0.06
query31	2.84	0.58	0.39
query32	3.22	0.55	0.46
query33	2.96	3.09	2.99
query34	15.82	5.17	4.47
query35	4.56	4.58	4.56
query36	0.66	0.50	0.48
query37	0.09	0.07	0.06
query38	0.05	0.04	0.04
query39	0.03	0.02	0.02
query40	0.16	0.14	0.13
query41	0.08	0.02	0.03
query42	0.03	0.02	0.02
query43	0.03	0.04	0.03
Total cold run time: 104.64 s
Total hot run time: 30.75 s

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 11, 2025
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@dataroaring dataroaring merged commit 74c8eed into apache:master Mar 12, 2025
25 of 26 checks passed
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants