Skip to content

Conversation

@macchiati
Copy link
Member

@macchiati macchiati commented Nov 10, 2024

https://github.com/unicode-org/properties/issues/507
See also https://github.com/unicode-org/unicode-reports/pull/247

Code to generate the property files and test data for UTS 58

  • Incorporates changes from the PAG
  • Also some wording updates
  • The Link_Email property is also modified to be narrower. The reasons for this are:
    • If we have to make changes later, it is less disruptive to broaden the character set than to narrow it.
    • The non-ASCII are less commonly supported currently
    • I went with the identifiers from UAX 31, modified by what is valid in the ASCII ranges for the local-part:
      • \p{XID_Continue}
      • [\p{block=basic_latin}-\p{Cc}] // ASCII
      • -[\u0020 ; : " ( ) [ ] @ \ < >] // email exclusions from ASCII

See also the related spec changes in https://github.com/unicode-org/unicode-reports/pull/247

@macchiati macchiati marked this pull request as draft November 10, 2024 18:20
@macchiati
Copy link
Member Author

macchiati commented Dec 2, 2025

TODO:

  1. Update data file & test data generator to match rev 1 draft 5.
  2. Add newer test data from ICANN.
  3. Change the data file folder to /Public/<version>/linkification/
  4. Change the filename SerializationTest.txt to FormattingTest.txt
  5. Create a unicodetools PR with the code, get it reviewed & merged.

@macchiati macchiati changed the title Linkification testing Linkification Data files and tooling Dec 4, 2025
@macchiati macchiati force-pushed the Linkification-testig branch from eebcf83 to 992ff07 Compare December 4, 2025 23:34
Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of only the data files so far.

@macchiati

This comment was marked as outdated.

@markusicu

This comment was marked as outdated.

@markusicu

This comment was marked as outdated.

@macchiati

This comment was marked as outdated.

@macchiati macchiati requested a review from markusicu December 16, 2025 21:30
Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only skimmed chunks of the Java code.

@macchiati
Copy link
Member Author

Added a note about the + in queries in minimal escaping:
https://unicode-org.github.io/unicode-reports/pull/247/tr58/tr58.html#minimal-escaping-algorithm

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment for now, and there is at least one of the previous comment without discussion. I am also going over your discussion doc.

@macchiati macchiati requested a review from markusicu December 21, 2025 00:05
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
#
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for later: These files have trailing spaces now on empty comment lines, from the JOIN_N_HASH in the generator. We should clean that up at some point.

@markusicu markusicu merged commit ac0201d into unicode-org:main Dec 23, 2025
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants