Skip to content

test: add unicode pattern and patternProperties tests for draft2020-12#837

Merged
jdesrosiers merged 7 commits intojson-schema-org:mainfrom
Shristibot:pattern-unicode-draft2020-12
Mar 9, 2026
Merged

test: add unicode pattern and patternProperties tests for draft2020-12#837
jdesrosiers merged 7 commits intojson-schema-org:mainfrom
Shristibot:pattern-unicode-draft2020-12

Conversation

@Shristibot
Copy link
Contributor

Adds minimal Unicode literal tests for pattern and patternProperties in draft‑2020‑12.
These tests use a simple non‑ASCII character (π) to confirm literal Unicode matching in the Unicode‑aware regex mode required by draft‑2020‑12, avoiding advanced regex features and focusing on basic literal behavior.

@Shristibot Shristibot requested a review from a team as a code owner February 11, 2026 09:14
Copy link
Member

@jdesrosiers jdesrosiers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern used here works the same with or without unicode mode enabled. You need to find something where /some-pattern/ works differently than /some-pattern/u. (Where the latter produces the correct behavior.

@karenetheridge
Copy link
Member

What are you trying to test for that we don't already have coverage for? I see tests for pattern ^\d+$ against the string "৪২" (BENGALI DIGIT FOUR, BENGALI DIGIT TWO) in tests/*/optional/ecmascript-regex.json .

@jdesrosiers
Copy link
Member

What are you trying to test for that we don't already have coverage for?

In 2020-12, we added the requirement that implementations should use regex in unicode mode. We currently don't have any required tests that cover that requirement.

I see tests for pattern ^\d+$ against the string "৪২" (BENGALI DIGIT FOUR, BENGALI DIGIT TWO) in tests/*/optional/ecmascript-regex.json .

Ah, good catch. Maybe we don't need new tests. Maybe we just need to move some optional tests to required?

@karenetheridge
Copy link
Member

I see tests for pattern ^\d+$ against the string "৪২" (BENGALI DIGIT FOUR, BENGALI DIGIT TWO) in tests/*/optional/ecmascript-regex.json .

Ah, good catch. Maybe we don't need new tests. Maybe we just need to move some optional tests to required?

These tests test the inverse -- that \d only matches the ascii digits, not other digits in unicode.

In 2020-12, we added the requirement that implementations should use regex in unicode mode. We currently don't have any required tests that cover that requirement.

Where is that coming from? draft2020-12 says we use ECMA-262's regular expression semantics, and at https://tc39.es/ecma262/#sec-compiletocharset it seems to be saying the opposite, that \d only matches ascii [0-9].

I'd love it to be the other way though -- I use unicode semantics in my implementation by default and these optional tests are marked as "expected failure", but I think we'd have to make that change in the next version.

@jdesrosiers
Copy link
Member

Ha, yeah, sorry, I've never looked closely at the optional tests, so I'm not really familiar with them.

Regular expressions SHOULD be built with the "u" flag (or equivalent) to provide Unicode support, or processed in such a way which provides Unicode support as defined by ECMA-262.

That's the part of the spec this PR should be testing. If there are optional tests that cover that, let's move them to required. If not, let's add new ones. If there are optional tests that need to be removed, let's do that too. I'm generally in favor of deleting the whole optional directory, so I have no problem with removing anything there that we don't think makes sense anymore.

@karenetheridge
Copy link
Member

Regular expressions SHOULD be built with the "u" flag (or equivalent)

Ah super, so the current optional tests are wrong then :) We should definitely fix that!

@karenetheridge
Copy link
Member

karenetheridge commented Feb 19, 2026

So, there's some interesting history here: #505 and #498. I've read these both again thoroughly and I don't see how #505 can be correct given what the spec says. The clear intent is to say: regexes should use ECMA-262 semantics, with the unicode flag. So all the tests that assert that non-ascii characters shouldn't match \w, or non-ascii digits should not match \d (edit: not \s), are wrong.

Also, there is a test that asserts that 0xFEFF should match \s, which is also wrong, because despite the character being named "ZERO WIDTH NO-BREAK SPACE", it's not actually in the Space character class. This can be verified by going directly to the Unicode properties database, and I found several threads in different languages and forums (e.g. golang, Stack Overflow) discussing this peculiarity as well.

@Shristibot since the scale of this issue has just grown a lot more from the original, feel free to punt it and it can be reassigned.

@Shristibot
Copy link
Contributor Author

Thanks for all the extra context and the pointers to the spec and existing regex tests.

It sounds like this PR has uncovered a larger cleanup around Unicode regex semantics and the optional/ecmascript-regex tests. I’d like to keep working on this rather than dropping it, but I’m not sure what the best next step is.

Would you prefer that I:
•Narrow this PR to a minimal change (for example, just adding/moving clearly correct Unicode-mode tests into the required draft-2020-12 tests), and leave the broader cleanup for follow-up PRs, or
•Expand this PR to also update/remove the incorrect optional tests now?

I can start reviewing the optional tests against the draft-2020-12 requirements as well—happy to adjust the scope based on what would be most useful.

@jdesrosiers
Copy link
Member

Would you prefer that I:
•Narrow this PR to a minimal change (for example, just adding/moving clearly correct Unicode-mode tests into the required draft-2020-12 tests), and leave the broader cleanup for follow-up PRs, or
•Expand this PR to also update/remove the incorrect optional tests now?

I'm fine with whatever you want to do. What's important for this PR is to test that we are testing that regex evaluation uses the "u" flag. If you want to ignore optional tests and leave that to be cleaned up in another issue, I'm ok with that.

Add tests using \p{Letter} pattern which requires the ECMA-262 u flag:
- pattern.json: verify Unicode-mode regex with ASCII letters, non-ASCII letters, and digits
- patternProperties.json: verify Unicode-mode pattern matching for property names
Copy link
Member

@jdesrosiers jdesrosiers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're on the right track now.

Please add these tests to the v1 tests as well.

Please revert the whitespace changes in patternProperties.json. The only changes should be the test that you added.

@Shristibot
Copy link
Contributor Author

We're on the right track now.

Please add these tests to the v1 tests as well.

Please revert the whitespace changes in patternProperties.json. The only changes should be the test that you added.
Hi @jdesrosiers ,
I have added tests to v1.

@jdesrosiers
Copy link
Member

@Shristibot There's one more thing remaining from my last review.

Please revert the whitespace changes in patternProperties.json. The only changes should be the test that you added.

Please do that. Then we're ready to merge.

@jdesrosiers jdesrosiers merged commit 06481b1 into json-schema-org:main Mar 9, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants