Skip to content

The new Chunkers "does not work" - Excel files #1001

Open
@aropb

Description

@aropb

Context / Scenario

It looks like this is a Cyrillic problem.

MaxTokensPerParagraph=1000
OverlappingTokens=200

I check all the text after the decoder, and everything is fine, for example (xlsx, the same thing happens with other formats):

"| 3856 | ГАРАКОЛОБСКИЙ ДЕТСКИЙ | ..."

Chunk from km-default:

"| 3856 | ГАРАКОЛОБСКИЙЙ ДЕТСКИЙЙ | ..."

I also see that sometimes words are cut off!

Maybe a problem (MarkDownChunker and PlainTextChunker) with the encoding here (! ? or ⁉)?

private static readonly SeparatorTrie s_explicitSeparators = new([
    // Symbol + space
    ". ", ".\t", ".\n", "\n\n", // note: covers also the case of multiple '.' like "....\n"
    "? ", "?\t", "?\n", // note: covers also the case of multiple '?' and '!?' like "?????\n" and "?!?\n"
    "! ", "!\t", "!\n", // note: covers also the case of multiple '!' and '?!' like "!!!\n" and "!?!\n"
    "⁉ ", "⁉\t", "⁉\n",
    "⁈ ", "⁈\t", "⁈\n",
    "⁇ ", "⁇\t", "⁇\n",
    "… ", "…\t", "…\n",
    // Multi-char separators without space, ordered by length
    "!!!!", "????", "!!!", "???", "?!?", "!?!", "!?", "?!", "!!", "??", "....", "...", "..",
    // 1 char separators without space
    ".", "?", "!", "⁉", "⁈", "⁇", "…",
]);

Another important point: for Excel, it is necessary that, if possible, the chunk is trimmed with high priority at the end of row ("\n").
In general, ideally, chunks should consist of complete sentences and lines (row in Excel).

For example "Version 1.1.1" - it is also mistakenly split into parts. The algorithm mistakenly considers this "Version 1." to be the end of a sentence.

For english (cut sentence):


Chunk 1:

" an emerging
local plan that has either been submitted for examination or has reached
Regulation 18 or Regulation 19 (Town and Country Planning (Local Planning)
(England) Regulations 2012) stage, including both a policies map and proposed
allocations towards meeting housing need. This provision does not apply to
authorities who are not required to demonstrate a housing land supply, as set out
These arrangements will apply for a period of two years from the
in paragraph 76.
publication date of this revision of the Framework.
For the purposes of plan-making
227. The policies in the original National Planning Policy Framework published in March
2012 will apply for the purpose of examining plans, where those plans were
submitted on or before 24 January 2019. Where such plans are withdrawn or
otherwise do not proceed to become part of the development plan, the policies
contained in this Framework will apply to any subsequent plan produced for the
area concerned.
228. For the purposes of the policy on larger-scale development in paragraph 22, this
applies only to plans that have not reached Regulation 19 of the Town and
Country Planning (Local Planning) (England) Regulations 2012 (pre-submission)
of this Framework was published on 20
stage at the point the previous version
79
As an exception to this, the policy contained in paragraph 76 and the related reference in footnote 8 of this
Framework should only be taken into account as a material consideration when dealing with applications made
on or after the date of publication of this version of the Framework.
80
Unless these strategic policies have been reviewed and found not to require updating. Where local housing
need is used as the basis for assessing whether a four year supply of specific deliverable sites exists, it should
be calculated using the standard method set out in national planning guidance.
65July 2021 (for Spatial Development Strategies this would refer to consultation
under section 335(2) of the Greater London Authority Act 1999).
229. For the purposes of the policy on renewable and low carbon energy and heat in
plans in paragraph 160, this policy does not apply to plans that have reached
Regulation 19 of the Town and Country Planning (Local Planning) (England)
Regulations 2012 (pre-submission) stage, or that reach this stage within three
months of the date of publication of
the previous version of this Framework
published on 5 September 2023. For Spatial Development Strategies, paragraph
160 does not apply to strategies that have reached consultation under section
335(2) of the Greater London Authority Act 1999 or that reach this stage within
three months of the date of publication of the previous version of this Framework
published on 5 September 2023.
230. The policies in this Framework (published on 19 December 2023) will apply for
the purpose of examining plans, where those plans reach regulation 19 of the
Town and Country Planning (Local Planning) (England) Regulations 2012 (pre-
submission) stage after 19 March 2024. Plans that reach pre-submission
consultation on or before this date will be examined under the relevant previous
version of the Framework in accordance with the above arrangements. For
Spatial Development Strategies, this Framework applies to strategies that have
reached consultation under section 335(2) of the Greater London Authority Act
1999 after 19 March 2024. Strategies that reach this stage on or before this date
will be examined under the relevant previous version of the Framework in
accordance with the above arrangements. Where plans or strategies are
withdrawn or otherwise do not proceed to become part of the development plan,
the policies contained in this Framework will apply to any subsequent plan or
strategy produced for the area concerned.
231. The Government will continue to explore with individual areas the potential for
planning freedoms and flexibilities, for example where this would facilitate an
increase in the amount of housing that can be delivered.
"

Chunk 2:

"submission) stage after 19 March 2024. Plans that reach pre-submission
consultation on or before this date will be examined under the relevant previous
version of the Framework in accordance with the above arrangements. For
Spatial Development Strategies, this Framework applies to strategies that have
reached consultation under section 335(2) of the Greater London Authority Act
1999 after 19 March 2024. Strategies that reach this stage on or before this date
will be examined under the relevant previous version of the Framework in
accordance with the above arrangements. Where plans or strategies are
withdrawn or otherwise do not proceed to become part of the development plan,
the policies contained in this Framework will apply to any subsequent plan or
strategy produced for the area concerned.
231. The Government will continue to explore with individual areas the potential for
planning freedoms and flexibilities, for example where this would facilitate an
increase in the amount of housing that can be delivered.
66Annex 2: Glossary
Affordable housing: housing for sale or rent, for those whose needs are not met by the
market (including housing that provides a subsidised route to home ownership and/or is
for essential local workers); and which complies with one or more of the following
81
definitions
:
a) Affordable housing for rent: meets all of the following conditions: (a) the rent is set in
accordance with the Government’s rent policy for Social Rent or Affordable Rent, or is
at least 20% below local market rents (including service charges where applicable); (b)
the landlord is a registered provider, except where it is included as part of a Build to
Rent scheme (in which case the landlord need not be a registered provider); and (c) it
includes provisions to remain at an affordable price for future eligible households, or
for the subsidy to be recycled for alternative affordable housing provision. For Build to
Rent schemes affordable housing for rent is expected to be the normal form of
affordable housing provision (and, in this context, is known as Affordable Private Rent).
b) Starter homes: is as specified in Sections 2 and 3 of the Housing and Planning Act
2016 and any secondary legislation made under these sections. The definition of a
starter home should reflect the meaning set out in statute and any such secondary
legislation at the time of plan-preparation or decision-making. Where secondary
legislation has the effect of limiting a household’s eligibility to purchase a starter home
to those with a particular maximum level of household income, those restrictions
should be used.
c) Discounted market sales housing: is that sold at a discount of at least 20% below
local market value. Eligibility is determined with regard to local incomes and local
house prices. Provisions should be in place to ensure housing remains at a discount
for future eligible households.
d) Other affordable routes to home ownership: is housing provided for sale that
provides a route to ownership for those who could not achieve home ownership
through the market. It includes shared ownership, relevant equity loans, other low cost
homes for sale (at a price equivalent to at least 20% below local market value) and
rent to buy (which includes a period of intermediate rent). Where public grant funding is
provided, there should be provisions for the homes to remain at an affordable price for
future eligible households, or for any receipts to be recycled for alternative affordable
housing provision, or refunded to Government or the relevant authority specified in the
funding agreement.
Air quality management areas: Areas designated by local authorities because they are
not likely to achieve national air quality objectives by the relevant deadlines.
Ancient or veteran tree: A tree which, because of its age, size and condition, is of
exceptional biodiversity, cultural or heritage value. All ancient trees are veteran trees. Not
all veteran trees are old enough to be ancient, but are old relative to other trees of the
same species. Very few trees of any species reach the ancient life-stage.
81
This definition should be read in conjunction with relevant policy contained in the Affordable Homes Update
Written Ministerial Statement published on 24 May 2021.
67Ancient woodland: An area that has been wooded continuously since at least 1600 AD.
It includes ancient semi-natural woodland and plantations on ancient woodland sites
(PAWS).
Annual position statement: A document setting out the 5 year housing land supply
position on 1st April each year, prepared by the local planning authority in consultation
with developers and others who have an impact on delivery.
Archaeological interest: There will be archaeological interest in a heritage asset if it
holds, or potentially holds, evidence of past human activity worthy of expert investigation
at some point.
"

Chunk 3:

" or heritage value. All ancient trees are veteran trees. Not
all veteran trees are old enough to be ancient, but are old relative to other trees of the
same species. Very few trees of any species reach the ancient life-stage.
81
This definition should be read in conjunction with relevant policy contained in the Affordable Homes Update
Written Ministerial Statement published on 24 May 2021.
67Ancient woodland: An area that has been wooded continuously since at least 1600 AD.
It includes ancient semi-natural woodland and plantations on ancient woodland sites
(PAWS).
Annual position statement: A document setting out the 5 year housing land supply
position on 1st April each year, prepared by the local planning authority in consultation
with developers and others who have an impact on delivery.
Archaeological interest: There will be archaeological interest in a heritage asset if it
holds, or potentially holds, evidence of past human activity worthy of expert investigation
at some point.
Article 4 direction: A direction made under Article 4 of the Town and Country Planning
(General Permitted Development) (England) Order 2015 which withdraws permitted
development rights granted by that Order.
Best and most versatile agricultural land: Land in grades 1, 2 and 3a of the Agricultural
Land Classification.
Brownfield land: See Previously developed land.
Brownfield land registers: Registers of previously developed land that local planning
authorities consider to be appropriate for residential development, having regard to criteria
in the Town and Country Planning (Brownfield Land Registers) Regulations 2017. Local
planning authorities will be able to trigger a grant of permission in principle for residential
development on suitable sites in their registers where they follow the required procedures.
Build to Rent: Purpose built housing that is typically 100% rented out. It can form part of
a wider multi-tenure development comprising either flats or houses, but should be on the
same site and/or contiguous with the main development. Schemes will usually offer longer
tenancy agreements of three years or more, and will typically be professionally managed
stock in single ownership and management control.
Climate change adaptation: Adjustments made to natural or human systems in response
to the actual or anticipated impacts of climate change, to mitigate harm or exploit
beneficial opportunities.
Climate change mitigation: Action to reduce the impact of human activity on the climate
system, primarily through reducing greenhouse gas emissions.
Coastal change management area: An area identified in plans as likely to be affected by
physical change to the shoreline through erosion, coastal landslip, permanent inundation
or coastal accretion.
Community forest: An area identified through the England Community Forest
Programme to revitalise countryside and green space in and around major conurbations.
Community Right to Build Order: An Order made by the local planning authority (under
the Town and Country Planning Act 1990) that grants planning permission for a site-
specific development proposal or classes of development.
68Community-led developments: A development instigated and taken forward by a not-
for-profit organisation set up and run primarily for the purpose of meeting the housing
needs of its members and the wider local community, rather than being a primarily
commercial enterprise. The organisation is created, managed and democratically
controlled by its members. It may take any one of various legal forms including a
community land trust, housing co-operative and community benefit society. Membership
of the organisation is open to all beneficiaries and prospective beneficiaries of that
organisation. The organisation should own, manage or steward the homes in a manner
consistent with its purpose, for example through a mutually supported arrangement with a
Registered Provider of Social Housing. The benefits of the development to the specified
community should be clearly defined and consideration given to how these benefits can
be protected over time, including in the event of the organisation being wound up.
Competent person (to prepare site investigation information): A person with a
recognised relevant qualification, sufficient experience in dealing with the type(s) of
pollution or land instability, and membership of a relevant professional organisation.
Conservation (for heritage policy): The process of maintaining and managing change to
a heritage asset in a way that sustains and, where appropriate, enhances its significance.
Decentralised energy: Local renewable and local low carbon energy sources.
Deliverable: To be considered deliverable, sites for housing should be available now,
offer a suitable location for development now, and be achievable with a realistic prospect
that housing will be delivered on the site within five years. "


Thanks.

What happened?

Critical error!

Importance

edge case

Platform, Language, Versions

KM 0.97.250211.1
LLamaSharp 0.21.0
NET 9.0.1

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions