Skip to content

CLDR-17851 SupplementalCalendarData parser, update ExampleGenerator #4642

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 29, 2025

Conversation

srl295
Copy link
Member

@srl295 srl295 commented Apr 25, 2025

  • minimal, modern parser
  • minimal test case
  • update example generator

CLDR-17851

ALLOW_MANY_COMMITS=true

srl295 added 2 commits April 25, 2025 16:00
- minimal, modern parser
- minimal test case
- commented-out 'dump' function
- update example generator to use
@srl295 srl295 self-assigned this Apr 25, 2025
- move TestEraMap into TestSDI
@srl295 srl295 mentioned this pull request Apr 25, 2025
1 task
@srl295
Copy link
Member Author

srl295 commented Apr 25, 2025

@sffc this gives us a data driven approach that should be less crashy on era data changes.
I haven't put in unit tests for the data but if you have suggestions please add them to this ticket.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand enough about how CLDR works to be able to give this a thorough review

@srl295
Copy link
Member Author

srl295 commented Apr 25, 2025

I don't understand enough about how CLDR works to be able to give this a thorough review

mostly just a FYI that CLDR could now have data quality checks on <calendarData> (nothing, sadly, parsed it before)

@srl295 srl295 marked this pull request as draft April 25, 2025 21:29
@srl295
Copy link
Member Author

srl295 commented Apr 25, 2025

example generator issue, will bring back soon

srl295 added 2 commits April 25, 2025 17:32
- use new supplemental data info
- BH, added in CLDR-18464, is out of order… attempt to find the prev and next era by date.
@srl295
Copy link
Member Author

srl295 commented Apr 25, 2025

alright, BH now shows AH 0 as its example. Which is better than AH 1 i guess.

anyway, everyone gets examples now

image

@srl295 srl295 marked this pull request as ready for review April 25, 2025 23:00
@srl295
Copy link
Member Author

srl295 commented Apr 25, 2025

i think this is ready for review - maybe the BH is a formatter issue

@AEApple
Copy link
Contributor

AEApple commented Apr 25, 2025

Thank you for doing this! I'm not familiar enough either to review this unfortunately, although I could try applying it in my branch and check if it fixed the error if that works?

@srl295
Copy link
Member Author

srl295 commented Apr 26, 2025 via email

@AEApple AEApple requested a review from conradarcturus April 26, 2025 20:38
Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few items

type = Integer.parseInt(xpath.getAttributeValue(INDEX, LDMLConstants.TYPE));

start = xpath.getAttributeValue(INDEX, LDMLConstants.START);
startCalendar = forDateString(start);
Copy link
Member

@macchiati macchiati Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Calendar is pretty heavy-weight for this. Would be simplest to just extract the millis, or use an Instant if we want type-safety

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, an even better reason is that then you (and your clients) can avoid complications with null. The earliest start date is Integer.MIN_VALUE and the latest end date is Integer.MAX_VALUE.

There is one further complication. Normally missing start/end dates get the above values. However, for the Japanese calendar, the end date is set based on the next line's start data - 1.
So

              <era type="0" start="645-6-19"/>
              <era type="1" start="650-2-15"/>
...
              <era type="235" start="1989-1-8" code="heisei"/>
              <era type="236" start="2019-5-1" code="reiwa"/>

Is equivalent to

              <era type="0" start="645-6-19" end="650-2-14" />
              <era type="1" start="650-2-15" end="671-12-31"/>
...
              <era type="235" start="1989-1-8" end="2019-4-30" code="heisei"/>
              <era type="236" start="2019-5-1" code="reiwa"/>

So that means you don't know the end data for an era until the next one gets added. And that's assuming that you read them in file order. Otherwise you need to read them in, and sort by start date.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need more clarity in the spec here?

before this, i couldn't find any code in CLDR that read these lines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that means you don't know the end data for an era until the next one gets added. And that's assuming that you read them in file order. Otherwise you need to read them in, and sort by start date.

That's why i had the end time be null there.

import org.unicode.cldr.icu.LDMLConstants;

public class SupplementalCalendarData implements Iterable<String> {
/** an <era> element */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/** an <era> element */
/** an <era> element, which represents the era information from a calendar such as:
* <pre>
* <calendarData><supplementalData>
* <calendar type="islamic">
* <calendarSystem type="lunar" />
* <eras>
* <era type="0" start="622-7-15" code="islamic" aliases="ah"/>
* <era type="1" end="622-7-14" code="islamic-inverse" aliases="bh"/>
* </eras>
* </calendar>
* </pre>
*/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or just link to the spec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior of start/end is documented in https://www.unicode.org/reports/tr35/tr35-69/tr35-dates.html#Calendar_Data

(It could use some wordsmithing to clarify that "open-ended" applies when there isn't a series; probably start that first. I'll file a ticket for that.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type = Integer.parseInt(xpath.getAttributeValue(INDEX, LDMLConstants.TYPE));

start = xpath.getAttributeValue(INDEX, LDMLConstants.START);
startCalendar = forDateString(start);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, an even better reason is that then you (and your clients) can avoid complications with null. The earliest start date is Integer.MIN_VALUE and the latest end date is Integer.MAX_VALUE.

There is one further complication. Normally missing start/end dates get the above values. However, for the Japanese calendar, the end date is set based on the next line's start data - 1.
So

              <era type="0" start="645-6-19"/>
              <era type="1" start="650-2-15"/>
...
              <era type="235" start="1989-1-8" code="heisei"/>
              <era type="236" start="2019-5-1" code="reiwa"/>

Is equivalent to

              <era type="0" start="645-6-19" end="650-2-14" />
              <era type="1" start="650-2-15" end="671-12-31"/>
...
              <era type="235" start="1989-1-8" end="2019-4-30" code="heisei"/>
              <era type="236" start="2019-5-1" code="reiwa"/>

So that means you don't know the end data for an era until the next one gets added. And that's assuming that you read them in file order. Otherwise you need to read them in, and sort by start date.

}

/** only works within the same cal system */
@Override
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@Override
@Override
/* Normally comparisons require all of the fields to be listed, to ensure that no
* two unequal values a and b will never have a.compareTo(b) = 0.
* However, the structure of the survey tool guaranties that no two eras (for different calendarTypes)
* will overlap in start/end dates.
*/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the survey tool doesn't guarantee this - do you mean the spec? the unit tests ought to validate this. But I'm not sure what all of the constraints are.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we don't make that clear. Filed https://unicode-org.atlassian.net/browse/CLDR-18576

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. It is the spec not the Survey Tool.

Comment on lines 107 to 108
.compare(getLatest(), o.getLatest())
.compare(getType(), o.getType())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.compare(getLatest(), o.getLatest())
.compare(getType(), o.getType())
.compare(getEnd(), o.getEnd())

Because of the above feature, you will never need the type for comparison; just the calendarType and end.

@srl295
Copy link
Member Author

srl295 commented Apr 29, 2025

@macchiati this is getting pretty heavy weight. -and not working. Could I go back to the lightweight parser that succeeds for purposes of examples and split out the heavier api questions?

@srl295
Copy link
Member Author

srl295 commented Apr 29, 2025

@macchiati this is getting pretty heavy weight. -and not working. Could I go back to the lightweight parser that succeeds for purposes of examples and split out the heavier api questions?

Specifically, back to Friday's version (but memoize it so the object is only created once). And split out a PR for the more heavyweight processor that tries to sort out the situation for Japanese, etc.

I was trying to make a parser that just reflected what was in the XML, to enable unit testing, but it is kind of turning into a general purpose implementation of the algorithm.

 - address some concerns
 - split other issues out to a separate PR- unicode-org#4648
@srl295 srl295 force-pushed the cldr-17851/sdi-era-metadata branch from 3339539 to 0efdaee Compare April 29, 2025 21:07
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • tools/cldr-code/src/main/java/org/unicode/cldr/test/ExampleGenerator.java is different
  • tools/cldr-code/src/main/java/org/unicode/cldr/util/DateConstants.java is no longer changed in the branch
  • tools/cldr-code/src/main/java/org/unicode/cldr/util/SupplementalCalendarData.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@srl295 srl295 requested a review from macchiati April 29, 2025 21:11
@srl295
Copy link
Member Author

srl295 commented Apr 29, 2025

@macchiati PTAL. This is basically what you saw Friday, plus fixed the bad import statement. As discussed, I moved the original branch over to #4648 to address the concerns.

@srl295 srl295 merged commit 12cd0ca into unicode-org:main Apr 29, 2025
12 checks passed
@srl295 srl295 deleted the cldr-17851/sdi-era-metadata branch April 29, 2025 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants