Clarifying the point of the CVs #211
Replies: 6 comments
-
|
Overall, I 100% agree with this. With IPSL Team, we develop a library esgvoc to interact with the CV structure in all esgvoc branches in Universe and Projects like CMIP6 and CMIP6Plus.
Exactly ! it will be address soon in the CV TT.
|
Beta Was this translation helpful? Give feedback.
-
Perfect, very happy to move things around! |
Beta Was this translation helpful? Give feedback.
-
|
@znichollscr it's useful to have this discussion. And I will likely tweak the quick reply below, as there is much more than this.. I note that the input4MIPs project is an anomaly, there are only two of the numerous datasets that use CMOR to write these data, and so the connection between the project and other project CVs is far looser. A key point here is in essence the CVs aim to reduce duplication, allow template/software/nomenclature reuse, which means that for any science project/MIP that wants to start coordinating activities and producing data, these templates/tools/nomenclature are immediately available for reuse. For CMIP6, which is the template we're building on, the CVs played the following role 1. Capture project information as it evolves
2. Format machine-readable information, enabling CMIP6/project compliant data writing
3. Once data written, facilitate ESGF publication
4. Once data published, facilitate data discovery and interpretation, e.g.
5. Reuse templates facilitating projects to 1. use CMOR to write data, 2. ESGF to publish data, and 3. generate a community vocabulary that threads connections between climate data across projects (e.g. standard variable names:
|
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the reply @durack1. That all makes sense. I think it also makes clear to me that the thing that was missing from previous iterations was the schema. With the JSON-LD work, that will close that gap and make it much easier to build tools on top of the CVs (for example, I wouldn't want to be building on top of our CVs because we've broken from our implied schema twice in the last two days). |
Beta Was this translation helpful? Give feedback.
-
This is true, it also abstracts information that should not be part of the project away, simplifying things markedly - e.g. we have a single definitive source of The core point of our November 2023 CVs, variables and infrastructure drop-in (slides here), was let's stop recreating things over and over and over again (CMIP3, CMIP5, CMIP6, CORDEX CMIP5/CMIP6/CMIP7, input4MIPs 6.0/6.5-7.0, obs4MIPs, etc etc), and rather stabilize around the reused components, standardize these, and then let projects leverage these without having to come up with their own duplicated nomenclature that is not centrally reusable.. |
Beta Was this translation helpful? Give feedback.
-
|
@znichollscr migrating this to a discussion to clean things up a bit |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Related to #177 and #110
The problem
I think it is still not clear what exactly the CVs are, particularly where in the information chain they sit.
What I hope to get out of this
I would like to get clarity about how the CVs are meant to evolve and be used. Not knowing this has sent @durack1 and I round in circles on multiple occasions. I think it is also the key barrier to building better tooling around the CVs (it has caused issues for me building https://github.com/PCMDI/input4MIPs_CVs, I think it is also causing issues with the tooling efforts in https://github.com/WCRP-CMIP/WCRP-universe/).
The use modes that I think cause confusion
Use mode implied by the name: defining allowed values
The name 'controlled vocabularies' suggests that these are the allowed set of values. In order words, the CVs define values people can use, then it's up to users to combine them as they wish.
Use mode in practice: an information source
This is what I think actually happens in practice. For example, in #177, the request was to add all information to the CVs so they could be used as the source from which to create citation entries.
In this second use mode, The CVs are seen as a source of information. In other words, the CVs don't define allowed values, they define the values.
Why I think this ambiguity causes confusion
Put very simply, it isn't clear whether the CVs define the schema for our data, or whether they are the data. This is obviously a problem, you can use a schema to define data structures, allowing tools to build on the structure they provide etc. You can use the data as an information source. However, you can't mix and match them because it defeats the point (e.g. if you're constantly adding new fields to your schemas, then you're constantly breaking downstream use; if you can never add new information to your data, then the data source isn't very useful).
A way out of this
I think @wolfiex has basically already shown the way out of this with the direction that https://github.com/WCRP-CMIP/WCRP-universe is heading: use something like JSON-LD consistently throughout. That provides both the schema and the data, in a way that clearly separates the two.
As part of this effort, it would be great to clarify what the schema for source ID is. #177 has shown that the current understanding in this repo is not sufficient for creating citations (hence the consideration of #200). However, is the proposed structure something that is going to be rolled out across all CVs, is that structure already in place everywhere and we have just missed it in input4MIPs or are we building something custom right now without any clear understanding of how this will scale beyond input4MIPs (@jitendra-kumar maybe you already have an idea about this)?
An implication
One implication of having the CVs define both the schema and the values is that data providers have to register everything in advance. They can't just make data compliant with the schema and turn up, they have to also pre-register their metadata values. That's ok, but it is an extra step (that might take getting used to?), which makes validation tools really important so people can get clear, consistent feedback about what they need to fix and validate data themselves (rather than having to do heaps of iterations, which can be very slow and frustrating for everyone).
Conclusion
I think there's basically two key next steps that come from this:
Interested in others' views, pinging @durack1 and @jitendra-kumar, also @taylor13 as this may be relevant for/solved by the CV TT. Please add/tag anyone else that might be interested
Beta Was this translation helpful? Give feedback.
All reactions