Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature](inverted index) Add a basic tokenizer #48716

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

zzzxl1993
Copy link
Contributor

@zzzxl1993 zzzxl1993 commented Mar 5, 2025

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

  1. Implement a basic tokenizer capable of efficiently performing basic segmentation on both Chinese and English text.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@zzzxl1993
Copy link
Contributor Author

run buildall

@zzzxl1993 zzzxl1993 force-pushed the 202503052301 branch 2 times, most recently from 47045de to 4aa64a1 Compare March 5, 2025 15:12
@zzzxl1993
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32726 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 4aa64a1b4e55b5cd5581b958c36e61306ee5c9b0, data reload: false

------ Round 1 ----------------------------------
q1	17578	5313	5137	5137
q2	2049	312	169	169
q3	10390	1359	709	709
q4	10221	1043	557	557
q5	7536	2459	2345	2345
q6	192	170	139	139
q7	899	733	613	613
q8	9301	1293	1064	1064
q9	4951	4953	4752	4752
q10	6815	2314	1879	1879
q11	472	277	260	260
q12	344	356	219	219
q13	17763	3748	3125	3125
q14	234	231	208	208
q15	554	493	476	476
q16	621	634	604	604
q17	577	864	348	348
q18	6903	6529	6416	6416
q19	1766	956	546	546
q20	309	316	192	192
q21	2790	2311	1995	1995
q22	1060	993	973	973
Total cold run time: 103325 ms
Total hot run time: 32726 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5218	5196	5122	5122
q2	235	324	232	232
q3	2182	2682	2319	2319
q4	1441	1806	1370	1370
q5	4265	4121	4191	4121
q6	207	163	125	125
q7	1991	1936	1761	1761
q8	2652	2691	2606	2606
q9	7184	7082	7139	7082
q10	2991	3179	2750	2750
q11	578	503	490	490
q12	728	764	633	633
q13	3500	3876	3250	3250
q14	287	279	271	271
q15	539	469	474	469
q16	687	683	644	644
q17	1124	1545	1388	1388
q18	7792	7490	7361	7361
q19	815	872	1282	872
q20	2008	1993	1874	1874
q21	5500	4939	4948	4939
q22	1120	1047	979	979
Total cold run time: 53044 ms
Total hot run time: 50658 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 183751 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 4aa64a1b4e55b5cd5581b958c36e61306ee5c9b0, data reload: false

query1	990	379	381	379
query2	6536	1869	1851	1851
query3	6799	215	206	206
query4	27216	23518	22949	22949
query5	4372	669	477	477
query6	307	209	204	204
query7	4636	500	294	294
query8	294	251	233	233
query9	8638	2622	2604	2604
query10	445	315	267	267
query11	15407	15259	14859	14859
query12	164	112	108	108
query13	1666	533	414	414
query14	9769	6107	6111	6107
query15	206	196	169	169
query16	7420	639	467	467
query17	1176	703	559	559
query18	1949	399	297	297
query19	190	193	158	158
query20	127	114	116	114
query21	212	120	103	103
query22	4193	4238	4007	4007
query23	34098	33155	33026	33026
query24	7865	2371	2382	2371
query25	537	472	390	390
query26	1237	263	154	154
query27	2654	469	333	333
query28	4338	2430	2415	2415
query29	756	549	438	438
query30	286	212	193	193
query31	959	868	759	759
query32	75	66	65	65
query33	557	353	308	308
query34	779	823	517	517
query35	795	812	733	733
query36	961	966	897	897
query37	114	94	76	76
query38	4159	4207	4095	4095
query39	1454	1384	1404	1384
query40	208	117	107	107
query41	56	59	50	50
query42	125	105	103	103
query43	494	484	461	461
query44	1278	773	789	773
query45	174	171	193	171
query46	814	1012	632	632
query47	1755	1842	1731	1731
query48	371	404	307	307
query49	787	488	414	414
query50	678	745	407	407
query51	4169	4184	4180	4180
query52	109	108	98	98
query53	230	256	185	185
query54	495	504	426	426
query55	85	82	81	81
query56	263	257	266	257
query57	1108	1117	1068	1068
query58	273	238	237	237
query59	2409	2672	2453	2453
query60	273	264	259	259
query61	125	118	119	118
query62	803	719	687	687
query63	228	191	193	191
query64	4371	1011	667	667
query65	4375	4316	4353	4316
query66	1113	406	307	307
query67	16087	15553	15341	15341
query68	6904	865	513	513
query69	480	311	265	265
query70	1226	1140	1099	1099
query71	403	286	260	260
query72	5574	2725	3813	2725
query73	734	759	349	349
query74	9180	9189	8748	8748
query75	3170	3178	2751	2751
query76	3227	1169	772	772
query77	475	378	295	295
query78	9796	10014	9182	9182
query79	1525	829	575	575
query80	610	525	478	478
query81	495	250	222	222
query82	193	127	99	99
query83	185	176	167	167
query84	257	100	80	80
query85	831	416	374	374
query86	388	269	291	269
query87	4628	4487	4467	4467
query88	2984	2247	2255	2247
query89	390	324	297	297
query90	1963	202	195	195
query91	221	138	117	117
query92	77	66	59	59
query93	1809	1040	578	578
query94	681	419	288	288
query95	356	270	262	262
query96	482	606	271	271
query97	3263	3455	3286	3286
query98	243	200	195	195
query99	1319	1363	1259	1259
Total cold run time: 271661 ms
Total hot run time: 183751 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.9 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 4aa64a1b4e55b5cd5581b958c36e61306ee5c9b0, data reload: false

query1	0.03	0.03	0.03
query2	0.08	0.04	0.03
query3	0.23	0.07	0.06
query4	1.62	0.11	0.11
query5	0.57	0.55	0.56
query6	1.21	0.72	0.73
query7	0.02	0.02	0.02
query8	0.04	0.03	0.04
query9	0.60	0.54	0.51
query10	0.57	0.58	0.56
query11	0.16	0.10	0.11
query12	0.14	0.11	0.12
query13	0.61	0.60	0.61
query14	2.66	2.84	2.68
query15	0.92	0.84	0.85
query16	0.37	0.37	0.37
query17	1.02	1.02	1.04
query18	0.22	0.20	0.20
query19	2.01	1.80	1.96
query20	0.02	0.01	0.01
query21	15.36	0.91	0.56
query22	0.76	1.17	0.96
query23	14.69	1.38	0.62
query24	7.07	1.65	0.63
query25	0.48	0.33	0.13
query26	0.44	0.16	0.15
query27	0.06	0.05	0.05
query28	10.06	0.87	0.44
query29	12.52	3.96	3.33
query30	0.25	0.08	0.06
query31	2.83	0.59	0.38
query32	3.23	0.56	0.47
query33	3.00	3.09	3.03
query34	15.78	5.10	4.45
query35	4.52	4.51	4.53
query36	0.66	0.49	0.47
query37	0.09	0.06	0.06
query38	0.05	0.04	0.04
query39	0.03	0.02	0.02
query40	0.16	0.13	0.13
query41	0.09	0.03	0.02
query42	0.04	0.03	0.02
query43	0.04	0.03	0.02
Total cold run time: 105.31 s
Total hot run time: 30.9 s

@zzzxl1993
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32754 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ecac441a3201d12261ef87bb282f4600d18f009b, data reload: false

------ Round 1 ----------------------------------
q1	17658	5190	5138	5138
q2	2052	306	185	185
q3	10378	1242	775	775
q4	10208	1046	557	557
q5	7528	2403	2332	2332
q6	195	165	134	134
q7	888	750	615	615
q8	9298	1294	1108	1108
q9	4885	4824	4752	4752
q10	6905	2315	1898	1898
q11	470	281	273	273
q12	345	353	228	228
q13	17800	3684	3080	3080
q14	237	240	217	217
q15	529	491	480	480
q16	640	618	589	589
q17	574	855	343	343
q18	7016	6378	6345	6345
q19	1233	950	536	536
q20	319	325	192	192
q21	2835	2208	1990	1990
q22	1078	1001	987	987
Total cold run time: 103071 ms
Total hot run time: 32754 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5140	5141	5144	5141
q2	238	328	234	234
q3	2176	2642	2283	2283
q4	1405	1851	1399	1399
q5	4233	4132	4146	4132
q6	212	163	127	127
q7	1859	1888	1803	1803
q8	2603	2643	2539	2539
q9	7272	7120	7145	7120
q10	3002	3233	2815	2815
q11	572	510	486	486
q12	699	773	647	647
q13	3478	3936	3282	3282
q14	265	299	280	280
q15	549	499	493	493
q16	654	673	632	632
q17	1128	1556	1363	1363
q18	7810	7453	7512	7453
q19	847	864	1018	864
q20	1952	2070	1885	1885
q21	5414	5043	4654	4654
q22	1063	1119	1019	1019
Total cold run time: 52571 ms
Total hot run time: 50651 ms

qidaye
qidaye previously approved these changes Mar 6, 2025
Copy link
Contributor

@qidaye qidaye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

github-actions bot commented Mar 6, 2025

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Mar 6, 2025
Copy link
Contributor

github-actions bot commented Mar 6, 2025

PR approved by anyone and no changes requested.

@doris-robot
Copy link

TPC-DS: Total hot run time: 192133 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ecac441a3201d12261ef87bb282f4600d18f009b, data reload: false

query1	1392	1049	1003	1003
query2	6189	1881	1821	1821
query3	11038	4475	4508	4475
query4	53392	25593	23118	23118
query5	5166	596	492	492
query6	345	202	193	193
query7	4894	497	293	293
query8	319	251	242	242
query9	5758	2616	2614	2614
query10	439	325	268	268
query11	15057	15121	14878	14878
query12	154	111	108	108
query13	1061	522	400	400
query14	10550	6905	6893	6893
query15	215	205	195	195
query16	7033	691	494	494
query17	1075	724	594	594
query18	1525	423	351	351
query19	213	207	175	175
query20	136	128	131	128
query21	212	172	102	102
query22	4479	4549	4364	4364
query23	33976	33289	33075	33075
query24	5743	2454	2463	2454
query25	451	467	403	403
query26	738	278	160	160
query27	1909	512	338	338
query28	2863	2480	2459	2459
query29	559	553	424	424
query30	269	223	188	188
query31	883	849	807	807
query32	72	67	63	63
query33	451	361	311	311
query34	798	901	514	514
query35	828	832	739	739
query36	936	1020	921	921
query37	121	100	73	73
query38	4339	4275	4213	4213
query39	1489	1433	1498	1433
query40	216	121	106	106
query41	51	54	50	50
query42	133	112	109	109
query43	507	518	485	485
query44	1341	800	798	798
query45	179	175	168	168
query46	851	1015	641	641
query47	1820	1864	1816	1816
query48	398	418	308	308
query49	712	503	438	438
query50	704	769	427	427
query51	4262	4240	4218	4218
query52	107	99	97	97
query53	236	268	185	185
query54	483	514	430	430
query55	90	84	83	83
query56	279	276	282	276
query57	1164	1150	1100	1100
query58	262	254	241	241
query59	2860	2662	2680	2662
query60	285	302	265	265
query61	124	119	117	117
query62	748	763	685	685
query63	226	191	192	191
query64	1900	1016	664	664
query65	4532	4409	4381	4381
query66	716	389	287	287
query67	15753	15735	15413	15413
query68	7085	875	507	507
query69	538	301	268	268
query70	1196	1147	1066	1066
query71	485	291	267	267
query72	5966	3664	3756	3664
query73	1171	747	351	351
query74	9024	9192	8869	8869
query75	3866	3179	2719	2719
query76	4173	1159	739	739
query77	653	364	356	356
query78	10052	10034	9257	9257
query79	2663	823	599	599
query80	596	511	439	439
query81	485	259	222	222
query82	656	122	91	91
query83	174	174	167	167
query84	287	96	74	74
query85	763	344	343	343
query86	375	291	284	284
query87	4352	4500	4415	4415
query88	3496	2199	2215	2199
query89	411	312	278	278
query90	1796	189	189	189
query91	137	142	105	105
query92	69	61	53	53
query93	1966	1051	579	579
query94	652	418	304	304
query95	343	261	255	255
query96	484	558	271	271
query97	3290	3399	3289	3289
query98	231	206	210	206
query99	1371	1397	1291	1291
Total cold run time: 296812 ms
Total hot run time: 192133 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.63 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ecac441a3201d12261ef87bb282f4600d18f009b, data reload: false

query1	0.03	0.03	0.03
query2	0.07	0.04	0.03
query3	0.24	0.07	0.06
query4	1.61	0.10	0.10
query5	0.56	0.53	0.55
query6	1.20	0.72	0.72
query7	0.03	0.03	0.02
query8	0.04	0.03	0.03
query9	0.59	0.52	0.52
query10	0.58	0.58	0.58
query11	0.16	0.11	0.11
query12	0.15	0.12	0.11
query13	0.62	0.60	0.59
query14	2.80	2.69	2.82
query15	0.92	0.86	0.84
query16	0.38	0.39	0.39
query17	1.05	1.06	1.04
query18	0.21	0.20	0.20
query19	1.91	1.80	1.92
query20	0.01	0.00	0.02
query21	15.36	0.90	0.54
query22	0.76	1.09	0.65
query23	15.01	1.39	0.60
query24	7.03	1.57	0.64
query25	0.53	0.23	0.06
query26	0.49	0.18	0.14
query27	0.05	0.05	0.05
query28	9.02	0.85	0.43
query29	12.53	3.98	3.41
query30	0.26	0.09	0.07
query31	2.82	0.57	0.39
query32	3.22	0.56	0.47
query33	3.01	3.00	3.05
query34	15.72	5.13	4.49
query35	4.57	4.49	4.57
query36	0.67	0.50	0.48
query37	0.09	0.07	0.06
query38	0.05	0.03	0.03
query39	0.02	0.02	0.03
query40	0.16	0.15	0.14
query41	0.08	0.03	0.03
query42	0.03	0.02	0.02
query43	0.03	0.03	0.04
Total cold run time: 104.67 s
Total hot run time: 30.63 s

@zzzxl1993
Copy link
Contributor Author

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Mar 6, 2025
@doris-robot
Copy link

TPC-H: Total hot run time: 32783 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ded9ea96cab50fe25544065d5b5264a06456208a, data reload: false

------ Round 1 ----------------------------------
q1	17620	5152	5043	5043
q2	2047	306	177	177
q3	10390	1274	705	705
q4	10222	1029	524	524
q5	8070	2477	2363	2363
q6	191	185	135	135
q7	949	720	614	614
q8	9303	1333	1122	1122
q9	5011	4847	4982	4847
q10	6820	2313	1918	1918
q11	468	274	272	272
q12	342	360	218	218
q13	17778	3717	3091	3091
q14	227	243	219	219
q15	561	521	503	503
q16	653	641	582	582
q17	588	885	337	337
q18	6892	6485	6461	6461
q19	1422	959	570	570
q20	318	327	195	195
q21	2820	2125	1941	1941
q22	1062	990	946	946
Total cold run time: 103754 ms
Total hot run time: 32783 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5211	5154	5098	5098
q2	239	327	239	239
q3	2185	2664	2300	2300
q4	1427	1840	1366	1366
q5	4277	4184	4143	4143
q6	205	165	124	124
q7	1965	1979	1800	1800
q8	2652	2597	2607	2597
q9	7354	7224	7227	7224
q10	2986	3187	2779	2779
q11	572	511	495	495
q12	708	808	637	637
q13	3589	3836	3301	3301
q14	275	303	286	286
q15	551	505	500	500
q16	648	686	670	670
q17	1169	1638	1298	1298
q18	7766	7605	7352	7352
q19	810	818	919	818
q20	1994	2047	1895	1895
q21	5405	5018	4850	4850
q22	1094	1074	1024	1024
Total cold run time: 53082 ms
Total hot run time: 50796 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 185819 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ded9ea96cab50fe25544065d5b5264a06456208a, data reload: false

query1	974	384	380	380
query2	6531	1876	1862	1862
query3	6796	218	211	211
query4	26647	23572	23483	23483
query5	4339	653	489	489
query6	321	205	193	193
query7	4603	510	296	296
query8	306	262	239	239
query9	8619	2518	2547	2518
query10	491	322	251	251
query11	15589	15154	15017	15017
query12	162	115	105	105
query13	1666	520	415	415
query14	8984	6556	6430	6430
query15	211	188	178	178
query16	7275	638	431	431
query17	1236	736	590	590
query18	1949	404	307	307
query19	199	193	158	158
query20	120	113	120	113
query21	216	135	102	102
query22	4375	4269	4281	4269
query23	33802	33030	33187	33030
query24	7755	2392	2390	2390
query25	557	468	411	411
query26	1231	272	156	156
query27	2326	494	341	341
query28	4076	2414	2379	2379
query29	769	594	452	452
query30	282	217	192	192
query31	938	835	748	748
query32	78	64	66	64
query33	553	352	294	294
query34	791	833	512	512
query35	803	798	726	726
query36	953	993	902	902
query37	119	97	84	84
query38	4086	4051	4212	4051
query39	1442	1406	1403	1403
query40	203	116	105	105
query41	55	55	54	54
query42	121	107	103	103
query43	489	505	492	492
query44	1279	793	792	792
query45	175	168	160	160
query46	828	1028	632	632
query47	1714	1799	1690	1690
query48	379	403	299	299
query49	796	516	419	419
query50	676	756	401	401
query51	4133	4209	4176	4176
query52	114	101	93	93
query53	231	251	188	188
query54	494	508	404	404
query55	80	85	87	85
query56	284	266	262	262
query57	1117	1152	1027	1027
query58	253	239	240	239
query59	2550	2509	2437	2437
query60	294	261	262	261
query61	122	120	123	120
query62	804	720	682	682
query63	233	187	193	187
query64	4392	1002	697	697
query65	4407	4323	4314	4314
query66	1145	418	313	313
query67	15913	15604	15306	15306
query68	8241	930	508	508
query69	475	294	267	267
query70	1156	1099	1093	1093
query71	464	289	273	273
query72	5573	3566	3742	3566
query73	791	767	352	352
query74	9095	9127	8814	8814
query75	3856	3159	2702	2702
query76	3723	1169	753	753
query77	779	384	289	289
query78	10209	10004	9381	9381
query79	3081	839	577	577
query80	677	572	480	480
query81	474	260	231	231
query82	668	129	100	100
query83	214	173	158	158
query84	290	104	71	71
query85	784	358	308	308
query86	388	292	293	292
query87	4476	4442	4431	4431
query88	3726	2289	2287	2287
query89	398	319	290	290
query90	1822	219	241	219
query91	137	137	109	109
query92	82	59	64	59
query93	1964	1078	588	588
query94	666	414	298	298
query95	356	271	252	252
query96	480	572	273	273
query97	3293	3382	3235	3235
query98	248	211	204	204
query99	1452	1362	1249	1249
Total cold run time: 275489 ms
Total hot run time: 185819 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.36 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ded9ea96cab50fe25544065d5b5264a06456208a, data reload: false

query1	0.03	0.03	0.03
query2	0.07	0.03	0.03
query3	0.24	0.07	0.07
query4	1.60	0.10	0.10
query5	0.56	0.55	0.55
query6	1.18	0.72	0.72
query7	0.03	0.02	0.02
query8	0.04	0.04	0.03
query9	0.60	0.52	0.52
query10	0.57	0.57	0.57
query11	0.15	0.10	0.10
query12	0.15	0.12	0.11
query13	0.62	0.60	0.60
query14	2.68	2.66	2.74
query15	0.93	0.86	0.86
query16	0.36	0.39	0.38
query17	1.03	1.02	1.04
query18	0.21	0.20	0.19
query19	1.89	1.83	2.00
query20	0.02	0.02	0.02
query21	15.36	0.89	0.55
query22	0.76	1.14	0.73
query23	14.84	1.37	0.64
query24	7.13	1.79	1.12
query25	0.50	0.19	0.18
query26	0.55	0.16	0.14
query27	0.05	0.05	0.05
query28	9.36	0.83	0.43
query29	12.57	3.98	3.32
query30	0.25	0.09	0.06
query31	2.81	0.58	0.39
query32	3.23	0.54	0.46
query33	3.08	3.06	3.03
query34	15.78	5.14	4.52
query35	4.54	4.51	4.54
query36	0.66	0.50	0.49
query37	0.10	0.06	0.07
query38	0.05	0.04	0.04
query39	0.03	0.03	0.03
query40	0.17	0.13	0.13
query41	0.09	0.03	0.02
query42	0.03	0.02	0.03
query43	0.04	0.03	0.03
Total cold run time: 104.94 s
Total hot run time: 31.36 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 34.56% (75/217) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 46.76% (12511/26756)
Line Coverage 36.38% (106632/293128)
Region Coverage 35.44% (54488/153762)
Branch Coverage 30.81% (27406/88960)

@@ -69,6 +70,7 @@ const std::string INVERTED_INDEX_PARSER_UNICODE = "unicode";
const std::string INVERTED_INDEX_PARSER_ENGLISH = "english";
const std::string INVERTED_INDEX_PARSER_CHINESE = "chinese";
const std::string INVERTED_INDEX_PARSER_ICU = "icu";
const std::string INVERTED_INDEX_PARSER_SIMPLE = "simple";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simple is not actually a good name for user to understand, maybe basic unicode or something else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to basic

@zzzxl1993 zzzxl1993 changed the title [feature](inverted index) Add a simple tokenizer [feature](inverted index) Add a basic tokenizer Mar 7, 2025
@zzzxl1993
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32753 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f8e8cb19d62248cebf249998f2285bc7472fad28, data reload: false

------ Round 1 ----------------------------------
q1	17601	5201	5099	5099
q2	2051	308	175	175
q3	10385	1703	729	729
q4	10217	1031	549	549
q5	7521	2437	2360	2360
q6	185	171	136	136
q7	904	766	612	612
q8	9304	1278	1173	1173
q9	4949	4867	4769	4769
q10	6818	2339	1904	1904
q11	473	279	258	258
q12	356	358	225	225
q13	17770	3717	3098	3098
q14	226	229	209	209
q15	542	502	492	492
q16	616	639	583	583
q17	570	866	342	342
q18	7046	6507	6357	6357
q19	1305	945	532	532
q20	322	332	195	195
q21	2807	2184	1993	1993
q22	1083	1017	963	963
Total cold run time: 103051 ms
Total hot run time: 32753 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5184	5137	5156	5137
q2	243	327	235	235
q3	2186	2704	2325	2325
q4	1476	1853	1377	1377
q5	4306	4189	4169	4169
q6	207	160	124	124
q7	1940	1953	1783	1783
q8	2614	2534	2487	2487
q9	7272	7169	7228	7169
q10	3015	3192	2768	2768
q11	590	540	493	493
q12	668	754	632	632
q13	3511	3909	3285	3285
q14	275	305	277	277
q15	522	470	476	470
q16	647	718	646	646
q17	1198	1622	1363	1363
q18	7849	7679	7479	7479
q19	854	814	818	814
q20	1990	2048	1870	1870
q21	5392	5038	4768	4768
q22	1078	1028	1006	1006
Total cold run time: 53017 ms
Total hot run time: 50677 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 185850 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f8e8cb19d62248cebf249998f2285bc7472fad28, data reload: false

query1	960	383	374	374
query2	6536	1961	1940	1940
query3	6794	214	214	214
query4	26294	23383	23279	23279
query5	4351	675	515	515
query6	318	202	194	194
query7	4614	509	298	298
query8	297	246	233	233
query9	8607	2568	2575	2568
query10	488	324	258	258
query11	15426	15300	14880	14880
query12	153	109	105	105
query13	1648	524	383	383
query14	8791	6308	6543	6308
query15	213	191	169	169
query16	7290	632	450	450
query17	1185	703	543	543
query18	1923	394	297	297
query19	195	185	159	159
query20	116	113	116	113
query21	204	123	102	102
query22	4048	4346	4142	4142
query23	34030	32896	33143	32896
query24	7694	2390	2440	2390
query25	582	479	418	418
query26	1242	280	164	164
query27	2462	482	348	348
query28	4227	2453	2397	2397
query29	777	578	461	461
query30	285	225	192	192
query31	956	839	735	735
query32	77	71	68	68
query33	578	360	310	310
query34	797	869	520	520
query35	800	824	734	734
query36	962	985	894	894
query37	128	104	85	85
query38	4232	4268	4080	4080
query39	1501	1412	1562	1412
query40	210	119	103	103
query41	55	51	50	50
query42	124	106	104	104
query43	509	512	486	486
query44	1320	810	812	810
query45	180	174	166	166
query46	835	1019	627	627
query47	1742	1786	1677	1677
query48	372	414	308	308
query49	799	495	438	438
query50	697	735	413	413
query51	4213	4248	4156	4156
query52	111	114	95	95
query53	232	258	190	190
query54	486	519	459	459
query55	84	77	88	77
query56	256	273	263	263
query57	1103	1115	1064	1064
query58	253	240	239	239
query59	2647	2638	2610	2610
query60	304	265	272	265
query61	128	122	128	122
query62	786	769	652	652
query63	235	191	194	191
query64	4375	1053	710	710
query65	4406	4327	4380	4327
query66	1147	412	306	306
query67	15620	15468	15301	15301
query68	9342	875	511	511
query69	463	304	270	270
query70	1226	1151	1132	1132
query71	450	311	268	268
query72	5701	3642	3677	3642
query73	817	750	353	353
query74	9194	8950	9113	8950
query75	4138	3135	2707	2707
query76	3596	1174	754	754
query77	984	395	298	298
query78	10027	10173	9303	9303
query79	1963	833	589	589
query80	674	517	468	468
query81	465	270	230	230
query82	346	133	100	100
query83	212	180	158	158
query84	293	97	77	77
query85	746	360	392	360
query86	338	293	287	287
query87	4378	4538	4351	4351
query88	2908	2257	2283	2257
query89	393	319	286	286
query90	2055	217	212	212
query91	145	142	111	111
query92	80	60	61	60
query93	1190	1074	638	638
query94	670	407	307	307
query95	352	267	258	258
query96	478	567	282	282
query97	3321	3388	3328	3328
query98	221	208	201	201
query99	1416	1391	1265	1265
Total cold run time: 273592 ms
Total hot run time: 185850 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.85 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f8e8cb19d62248cebf249998f2285bc7472fad28, data reload: false

query1	0.03	0.03	0.03
query2	0.08	0.03	0.03
query3	0.23	0.07	0.07
query4	1.62	0.10	0.10
query5	0.55	0.54	0.54
query6	1.19	0.71	0.71
query7	0.02	0.02	0.01
query8	0.04	0.04	0.03
query9	0.59	0.53	0.52
query10	0.58	0.62	0.60
query11	0.15	0.11	0.11
query12	0.15	0.11	0.11
query13	0.61	0.61	0.60
query14	2.80	2.74	2.85
query15	0.91	0.86	0.86
query16	0.38	0.37	0.37
query17	1.02	1.04	1.06
query18	0.21	0.20	0.19
query19	1.89	1.85	2.00
query20	0.01	0.02	0.01
query21	15.38	0.88	0.53
query22	0.76	1.28	0.69
query23	14.83	1.38	0.65
query24	7.34	1.67	0.75
query25	0.49	0.21	0.09
query26	0.47	0.16	0.15
query27	0.05	0.05	0.05
query28	9.72	0.84	0.42
query29	12.55	3.94	3.25
query30	0.25	0.09	0.06
query31	2.81	0.58	0.40
query32	3.22	0.55	0.47
query33	3.00	2.98	3.05
query34	15.78	5.13	4.54
query35	4.47	4.50	4.50
query36	0.66	0.50	0.48
query37	0.10	0.06	0.07
query38	0.05	0.04	0.04
query39	0.03	0.02	0.02
query40	0.17	0.15	0.13
query41	0.08	0.03	0.02
query42	0.04	0.03	0.02
query43	0.04	0.03	0.03
Total cold run time: 105.35 s
Total hot run time: 30.85 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 34.98% (78/223) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 46.87% (12540/26755)
Line Coverage 36.46% (106909/293203)
Region Coverage 35.50% (54597/153811)
Branch Coverage 30.84% (27447/89004)

@zzzxl1993
Copy link
Contributor Author

run buildall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants