Skip to content

[Bug] When using Chinese markdown file, GetTokens function returns incorrectly #1047

Open
@hty579

Description

@hty579

Context / Scenario

When using Chinese markdown file, GetTokens function returns incorrectly.

CL100KTokenizer cL100KTokenizer=new CL100KTokenizer();
var result= cL100KTokenizer.GetTokens("交通运输部关于发布《公路桥涵设计通用规范》的公告\r\n现发布《公路桥涵设计通用规范》(JTG D60-2015),作为公路工程行业标准,自 2015 年 12 月 1 日起施行,原《公路桥涵设计通用规范》(JTG D60-2004)同时废止。");

What happened?

Get the right result, this will affect MarkDownChunker's data splitting.

Importance

a fix would make my life easier

Platform, Language, Versions

C#
Microsoft.KernelMemory.Core
V0.98.250324.1

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions