Skip to content

Implement IANA normalizer baked data provider#251

Merged
nekevss merged 19 commits intomainfrom
impl-iana-baked-provider
Mar 31, 2025
Merged

Implement IANA normalizer baked data provider#251
nekevss merged 19 commits intomainfrom
impl-iana-baked-provider

Conversation

@nekevss
Copy link
Copy Markdown
Member

@nekevss nekevss commented Mar 22, 2025

This PR begins the initial work for implementing baked data providers.

There is probably going to be a lot more work to be done on this topic overall (especially for tzif support). This is meant as a minimum proof of concept that adds support for IANA identifier normalization, and adds that support to FsTzdbProvider.

List of changes:

  • Adds bakeddata internal tool (this will need to be iterated on and further built out probably).
  • Adds a temporal_provider crate for sourcing zoneinfo data and bakeddata struct definitions.
  • Adds the newly baked SINGLETON_IANA_NORMALIZER to FsTzdbProvider

General points of consideration:

  • I'm not entirely sure how much of an infrastructure we want to build out to begin with, so the goal here was to be fairly minimal in the approach.
  • My assumption is that the large diff here is from adding in the zoneinfo files into the repository. We can long term request the data from IANA as they are publicly hosted, but I do think that the locality of having the files in the project for reference is nice. There is a good argument for removing the makefile, newsfile, and a couple things like that.
  • For bakeddata's output, I've added both a debug and file path. I think this is probably going to be a good approach for maintenance long term. The thought approach here is that we should ideally not be committing generated binary without having a debug view of it.

nekevss added 6 commits March 22, 2025 17:03
This removes the iana-datagen idea (at least for now, who knows),
and adds a new internal tool, bakeddata, for generating baked
data. Implements a baseline proof of concept, and introduces it
into temporal_rs.
@nekevss nekevss requested a review from jedel1043 March 22, 2025 22:29
@nekevss nekevss force-pushed the impl-iana-baked-provider branch from 76b4c82 to 8c87948 Compare March 23, 2025 01:36
@nekevss nekevss added C-enhancement New feature or request C-internal Internal library improvements labels Mar 23, 2025
@jedel1043
Copy link
Copy Markdown
Member

Do we really need to commit the whole tzdata to the repo? I would expect our datagen tool to fetch that from its repo instead.

@nekevss
Copy link
Copy Markdown
Member Author

nekevss commented Mar 23, 2025

I was more thinking of adding that as a feature after some iterations. At the very least, I would prefer to have the baseline files so that this can be built without having to go over the network. Plus, it's good reference data.

Although, I will admit. We can probably trim down from what is currently there.

@jedel1043
Copy link
Copy Markdown
Member

jedel1043 commented Mar 23, 2025

Even then, the tool can just point to a certain directory and ask for downloading the data manually, without having to fetch it from the network. The problem with vendoring the whole data is that we can always forget to delete it if we don't need it anymore, or forget to exclude it from the Cargo.toml when publishing the crate, etc.

@nekevss
Copy link
Copy Markdown
Member Author

nekevss commented Mar 23, 2025

I don't think it would be excluded on publish in that scenario...

I can make the change. But we will lose the tests in provider crate as a result, but maybe those can be added back in as we build out.

@jedel1043
Copy link
Copy Markdown
Member

I can make the change. But we will lose the tests in provider crate as a result, but maybe those can be added back in as we build out.

Wait, why do we lose tests if we don't bundle the data?

In my head, the process of implementing a provider would be:

  1. Baked tool takes tzdata and generates Rust files with the data according to IanaIdentifierNormalizer.
  2. TzdbDataProvider takes the data generated by the baked tool and wraps it for easy consumption, without having any knowledge on how the data was generated.

Is this roughly what the PR does? Or am I misunderstanding something?

@nekevss
Copy link
Copy Markdown
Member Author

nekevss commented Mar 23, 2025

No, that's exactly what the PR does.

I was able to preserve the test by using the singleton to test the generated file.

Beforehand, since temporal_provider was able to fully build the data with the cached files, we were able to build locally and test IanaIdentifierNormalizer. We may be able to replicate the tests directly on the struct in the future, but we'll probably have to feature flag it.

@jedel1043
Copy link
Copy Markdown
Member

Beforehand, since temporal_provider was able to fully build the data with the cached files, we were able to build locally and test IanaIdentifierNormalizer.

That's what I'm not getting. If the data pipeline is clearly delimited in two steps, I don't understand why we would need to "build the data" again, if we already have the data built by the baked tool. Can't we just pull the data built to do tests?

@nekevss
Copy link
Copy Markdown
Member Author

nekevss commented Mar 23, 2025

Currently temporal_provider contains both the sourcing logic and the data. To use ICU4X as a reference, this would be both the icu_provider_source and icu_provider_data wrapped into one crate. bakeddata mainly handles taking the source and struct definitions from "source" and generating the baked data. Because of this, some of the baked export logic currently lives in bakeddata because it didn't entirely make much sense to have it implemented and live in temporal_provider.

@nekevss
Copy link
Copy Markdown
Member Author

nekevss commented Mar 23, 2025

Can't we just pull the data built to do tests?

We can. Testing against the baked singleton is the current approach. I'm mostly being nitpicky that we are testing against the singleton and not explicitly against the struct itself.

@jedel1043
Copy link
Copy Markdown
Member

Ahhhh, got it.

I'm mostly being nitpicky that we are testing against the singleton and not explicitly against the struct itself.

If we don't want to depend on the data itself for tests, we can do the same as ICU4X and create a struct that "consumes" the data instead of using the data directly. That way, we can use some custom testing data instead of the baked data for testing.

@nekevss
Copy link
Copy Markdown
Member Author

nekevss commented Mar 23, 2025

Yeah, that's probably a good approach that can be iterated on in follow ups

@nekevss
Copy link
Copy Markdown
Member Author

nekevss commented Mar 23, 2025

Just realized that the debug companion file was missing due to the gitignore. I changed the directory to test. Some adjustments were made to the .gitignore to allow the files to be committed.

@nekevss
Copy link
Copy Markdown
Member Author

nekevss commented Mar 24, 2025

Actually this doesn't totally close #232, but it is a step closer. I'm still fine with merging this as and fixing it with subsequent follow ups.

EDIT: Actually supporting #232 is definitely going to require more PRs, and a bit of a scope creep as we will probably have to implement our own zoneinfo parser. So moving ahead with this in the short term is most likely the best option (although, I'm going to see how ICU4X is supporting the zone.tab file).

Copy link
Copy Markdown
Member

@jedel1043 jedel1043 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great starting point! After this, we can start adding more functionality like wrapper structs or a tzif provider.

@nekevss
Copy link
Copy Markdown
Member Author

nekevss commented Mar 31, 2025

we can start adding more functionality like wrapper structs or a tzif provider.

Already in the works 😉

@nekevss nekevss merged commit abf429c into main Mar 31, 2025
8 checks passed
@nekevss nekevss deleted the impl-iana-baked-provider branch March 31, 2025 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C-enhancement New feature or request C-internal Internal library improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants