Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](replica num) Fix the decrease in the number of replicas and une… #48704

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

deardeng
Copy link
Contributor

@deardeng deardeng commented Mar 5, 2025

…ven distribution of replicas among bes

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

When reducing the number of table replicas, the decision to drop replicas is currently based on the load situation of the BE nodes. However, this approach can result in the node with high BE load dropping many replicas at the same time, leading to severe CPU imbalance in the BE cluster.

The fix is to count the distributed tablet numbers on each BE when altering the number of replicas. Based on this mapping, we will determine which BEs to drop replicas from.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Mar 5, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@deardeng deardeng marked this pull request as draft March 5, 2025 11:43
@deardeng deardeng marked this pull request as ready for review March 9, 2025 17:08
@deardeng
Copy link
Contributor Author

deardeng commented Mar 9, 2025

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32472 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit fb243de499ae8095a530b6ae44d9e9da9b20e9f6, data reload: false

------ Round 1 ----------------------------------
q1	17604	5161	5097	5097
q2	2054	287	167	167
q3	10449	1252	765	765
q4	10226	1037	534	534
q5	7609	2377	2327	2327
q6	196	168	143	143
q7	894	747	598	598
q8	9309	1230	1062	1062
q9	4921	4840	4795	4795
q10	6814	2285	1898	1898
q11	473	278	258	258
q12	340	357	214	214
q13	17760	3671	3073	3073
q14	228	229	221	221
q15	539	494	478	478
q16	618	617	587	587
q17	556	848	342	342
q18	6739	6444	6309	6309
q19	1358	959	543	543
q20	331	317	189	189
q21	2788	2098	1915	1915
q22	1071	984	957	957
Total cold run time: 102877 ms
Total hot run time: 32472 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5134	5102	5138	5102
q2	233	331	227	227
q3	2155	2648	2281	2281
q4	1482	1913	1460	1460
q5	4257	4107	4143	4107
q6	206	164	122	122
q7	1904	1922	1751	1751
q8	2606	2634	2554	2554
q9	7315	7447	7298	7298
q10	3089	3281	2848	2848
q11	570	516	487	487
q12	703	815	647	647
q13	3530	3971	3479	3479
q14	289	291	268	268
q15	538	476	482	476
q16	671	703	642	642
q17	1160	1605	1371	1371
q18	8114	7523	7549	7523
q19	821	750	808	750
q20	1966	2089	1882	1882
q21	5468	4998	4822	4822
q22	1113	1046	1037	1037
Total cold run time: 53324 ms
Total hot run time: 51134 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 191447 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit fb243de499ae8095a530b6ae44d9e9da9b20e9f6, data reload: false

query1	1400	1017	973	973
query2	6143	1929	1907	1907
query3	11153	4676	4601	4601
query4	54291	24909	22937	22937
query5	5293	525	487	487
query6	410	202	192	192
query7	5333	502	297	297
query8	326	250	247	247
query9	7051	2511	2508	2508
query10	431	305	253	253
query11	15651	14981	14884	14884
query12	154	112	110	110
query13	1278	512	427	427
query14	10161	7449	6252	6252
query15	206	194	188	188
query16	6473	673	494	494
query17	1064	702	550	550
query18	852	375	312	312
query19	206	189	177	177
query20	129	127	130	127
query21	207	125	103	103
query22	4565	4481	4256	4256
query23	34081	33203	33355	33203
query24	5642	2395	2446	2395
query25	472	469	405	405
query26	644	281	156	156
query27	1652	490	338	338
query28	2807	2417	2432	2417
query29	577	577	415	415
query30	284	227	197	197
query31	907	871	819	819
query32	84	66	63	63
query33	453	378	305	305
query34	748	847	537	537
query35	815	836	747	747
query36	924	999	893	893
query37	114	102	81	81
query38	4298	4225	4230	4225
query39	1478	1448	1478	1448
query40	235	114	103	103
query41	51	53	48	48
query42	122	117	103	103
query43	501	516	491	491
query44	1302	783	796	783
query45	185	173	169	169
query46	851	1039	647	647
query47	1831	1917	1791	1791
query48	385	434	304	304
query49	707	516	441	441
query50	717	753	421	421
query51	4308	4270	4211	4211
query52	101	99	92	92
query53	230	257	185	185
query54	494	485	419	419
query55	93	84	84	84
query56	279	282	268	268
query57	1140	1174	1124	1124
query58	247	237	233	233
query59	2883	2802	2727	2727
query60	293	274	264	264
query61	149	145	149	145
query62	749	780	680	680
query63	227	199	212	199
query64	1524	1193	701	701
query65	4604	4455	4461	4455
query66	736	392	294	294
query67	15596	15591	15352	15352
query68	8407	916	499	499
query69	551	299	274	274
query70	1230	1148	1108	1108
query71	496	288	277	277
query72	5570	3585	3952	3585
query73	1120	749	345	345
query74	9410	9145	8971	8971
query75	3771	3180	2718	2718
query76	4455	1178	744	744
query77	579	364	278	278
query78	10016	10202	9283	9283
query79	1988	817	579	579
query80	860	534	508	508
query81	479	250	227	227
query82	624	123	91	91
query83	288	170	152	152
query84	279	99	68	68
query85	780	351	317	317
query86	375	297	290	290
query87	4364	4572	4433	4433
query88	2799	2210	2235	2210
query89	428	385	286	286
query90	1951	215	219	215
query91	134	146	108	108
query92	80	60	64	60
query93	1218	1030	580	580
query94	690	413	293	293
query95	349	270	269	269
query96	484	565	277	277
query97	3358	3374	3235	3235
query98	224	209	203	203
query99	1427	1381	1304	1304
Total cold run time: 298081 ms
Total hot run time: 191447 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.3 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit fb243de499ae8095a530b6ae44d9e9da9b20e9f6, data reload: false

query1	0.04	0.03	0.03
query2	0.07	0.04	0.05
query3	0.24	0.06	0.06
query4	1.61	0.10	0.10
query5	0.57	0.55	0.57
query6	1.21	0.70	0.73
query7	0.02	0.02	0.02
query8	0.04	0.03	0.03
query9	0.62	0.52	0.51
query10	0.58	0.62	0.58
query11	0.15	0.10	0.10
query12	0.14	0.11	0.11
query13	0.62	0.60	0.59
query14	2.81	2.70	2.69
query15	0.93	0.84	0.85
query16	0.40	0.38	0.38
query17	1.04	1.03	1.07
query18	0.21	0.20	0.20
query19	1.88	1.80	2.01
query20	0.01	0.01	0.01
query21	15.41	0.92	0.55
query22	0.76	1.22	0.68
query23	14.91	1.37	0.60
query24	9.12	0.65	0.29
query25	0.29	0.14	0.09
query26	0.37	0.16	0.14
query27	0.06	0.05	0.05
query28	8.12	0.86	0.43
query29	12.87	3.93	3.29
query30	0.25	0.09	0.07
query31	2.83	0.60	0.38
query32	3.23	0.54	0.47
query33	2.98	2.99	3.02
query34	15.93	5.13	4.55
query35	4.58	4.58	4.52
query36	0.67	0.49	0.49
query37	0.09	0.07	0.06
query38	0.05	0.04	0.04
query39	0.03	0.03	0.03
query40	0.17	0.14	0.13
query41	0.08	0.03	0.03
query42	0.04	0.03	0.02
query43	0.03	0.03	0.03
Total cold run time: 106.06 s
Total hot run time: 30.3 s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants