Clarifying the point of the CVs #211

znichollscr · 2025-02-25T19:05:00Z

znichollscr
Feb 25, 2025
Maintainer

Related to #177 and #110

The problem

I think it is still not clear what exactly the CVs are, particularly where in the information chain they sit.

What I hope to get out of this

I would like to get clarity about how the CVs are meant to evolve and be used. Not knowing this has sent @durack1 and I round in circles on multiple occasions. I think it is also the key barrier to building better tooling around the CVs (it has caused issues for me building https://github.com/PCMDI/input4MIPs_CVs, I think it is also causing issues with the tooling efforts in https://github.com/WCRP-CMIP/WCRP-universe/).

The use modes that I think cause confusion

Use mode implied by the name: defining allowed values

The name 'controlled vocabularies' suggests that these are the allowed set of values. In order words, the CVs define values people can use, then it's up to users to combine them as they wish.

Use mode in practice: an information source

This is what I think actually happens in practice. For example, in #177, the request was to add all information to the CVs so they could be used as the source from which to create citation entries.

In this second use mode, The CVs are seen as a source of information. In other words, the CVs don't define allowed values, they define the values.

Why I think this ambiguity causes confusion

Put very simply, it isn't clear whether the CVs define the schema for our data, or whether they are the data. This is obviously a problem, you can use a schema to define data structures, allowing tools to build on the structure they provide etc. You can use the data as an information source. However, you can't mix and match them because it defeats the point (e.g. if you're constantly adding new fields to your schemas, then you're constantly breaking downstream use; if you can never add new information to your data, then the data source isn't very useful).

A way out of this

I think @wolfiex has basically already shown the way out of this with the direction that https://github.com/WCRP-CMIP/WCRP-universe is heading: use something like JSON-LD consistently throughout. That provides both the schema and the data, in a way that clearly separates the two.

As part of this effort, it would be great to clarify what the schema for source ID is. #177 has shown that the current understanding in this repo is not sufficient for creating citations (hence the consideration of #200). However, is the proposed structure something that is going to be rolled out across all CVs, is that structure already in place everywhere and we have just missed it in input4MIPs or are we building something custom right now without any clear understanding of how this will scale beyond input4MIPs (@jitendra-kumar maybe you already have an idea about this)?

An implication

One implication of having the CVs define both the schema and the values is that data providers have to register everything in advance. They can't just make data compliant with the schema and turn up, they have to also pre-register their metadata values. That's ok, but it is an extra step (that might take getting used to?), which makes validation tools really important so people can get clear, consistent feedback about what they need to fix and validate data themselves (rather than having to do heaps of iterations, which can be very slow and frustrating for everyone).

Conclusion

I think there's basically two key next steps that come from this:

Is there already some agreed form for source ID, that we should just update to match here (which will then basically make DOE OSTI DOIs for input4MIPs #177 trivial)?
Are there any documents that clarify how the CVs work/are built? Even as simple as the following, put somewhere clearly accessible would have helped me a lot, "The controlled vocabularies define the metadata used in CMIP. They are composed of the data schemas (e.g. which fields are allowed and what type is expected for each field) as well as the data itself (i.e. what data entries do we have and what do they mean)."

Interested in others' views, pinging @durack1 and @jitendra-kumar, also @taylor13 as this may be relevant for/solved by the CV TT. Please add/tag anyone else that might be interested

ltroussellier · 2025-02-26T10:52:40Z

ltroussellier
Feb 26, 2025
Collaborator

Overall, I 100% agree with this.

With IPSL Team, we develop a library esgvoc to interact with the CV structure in all esgvoc branches in Universe and Projects like CMIP6 and CMIP6Plus.

Those are example of the structure that is use for the development of the esgvoc library but are not "officially" "valid".

I think there's basically two key next steps that come from this:
Is there already some agreed form for source ID, that we should just update to match here (which will then basically make 
#177 trivial)?
Are there any documents that clarify how the CVs work/are built? Even as simple as the following, put somewhere clearly accessible would have helped me a lot, "The controlled vocabularies define the metadata used in CMIP. They are composed of the data schemas (e.g. which fields are allowed and what type is expected for each field) as well as the data itself (i.e. what data entries do we have and what do they mean)."

Exactly ! it will be address soon in the CV TT.

On the flip side, "I think we have work to do together to integrate input4MIPs_CVs into the new CV structure and factorize as much as possible in the universe. I have an idea in mind with this.

0 replies

znichollscr · 2025-02-26T13:55:49Z

znichollscr
Feb 26, 2025
Maintainer Author

On the flip side, "I think we have work to do together to integrate input4MIPs_CVs into the new CV structure and factorize as much as possible in the universe. I have an idea in mind with this.

Perfect, very happy to move things around!

0 replies

durack1 · 2025-02-26T17:12:00Z

durack1
Feb 26, 2025
Maintainer

@znichollscr it's useful to have this discussion. And I will likely tweak the quick reply below, as there is much more than this..

I note that the input4MIPs project is an anomaly, there are only two of the numerous datasets that use CMOR to write these data, and so the connection between the project and other project CVs is far looser.

A key point here is in essence the CVs aim to reduce duplication, allow template/software/nomenclature reuse, which means that for any science project/MIP that wants to start coordinating activities and producing data, these templates/tools/nomenclature are immediately available for reuse.

For CMIP6, which is the template we're building on, the CVs played the following role

1. Capture project information as it evolves

register CMIP6 activity_id/MIPs, and experiment_id's in one place, centralizing project info
register CMIP6 source_id's (and their institution_id's) along with their "activity_participation" linking modeling group science interests with MIPs proposing them
allow other infrastructure to be built with a dependence on upstream curated information, enforcing consistency - e.g. data request, data citation, etc

2. Format machine-readable information, enabling CMIP6/project compliant data writing

CMOR
XIOS, etc (@ltroussellier can speak to how much XIOS depended on the CMIP6_CVs, I am curious to know)
Other software as defined in Table D1 (Durack et al., 2025) plus others that are omitted

3. Once data written, facilitate ESGF publication

ESGF publisher - CMIP6_CVs enable metadata validation
Metagrid search interface pre-population - before the database containing published data info is available

4. Once data published, facilitate data discovery and interpretation, e.g.

PANGEO
ametsoc.org
wiley.com
wcrp-climate.org
wcrp-cmip.org
etc, etc (see e.g. screenshot at the bottom of the thread)

5. Reuse templates facilitating projects to 1. use CMOR to write data, 2. ESGF to publish data, and 3. generate a community vocabulary that threads connections between climate data across projects (e.g. standard variable names: `tas`, `pr`, standard project and experiment names: CORDEX uses CMIP6 data to drive dynamical downscaling, etc, etc

CORDEX-CMIP6
obs4MIPs
input4MIPs v6.0, and v6.5-7.0
CMIP5, CMIP3
MetOffice arise
Other projects as defined in Table B1 (Durack et al., 2025)
etc, etc

External sites that are linking to https://wcrp-cmip.github.io/CMIP6_CVs/docs

0 replies

znichollscr · 2025-02-26T17:33:43Z

znichollscr
Feb 26, 2025
Maintainer Author

Thanks for the reply @durack1. That all makes sense. I think it also makes clear to me that the thing that was missing from previous iterations was the schema. With the JSON-LD work, that will close that gap and make it much easier to build tools on top of the CVs (for example, I wouldn't want to be building on top of our CVs because we've broken from our implied schema twice in the last two days).

0 replies

durack1 · 2025-02-26T17:40:29Z

durack1
Feb 26, 2025
Maintainer

With the JSON-LD work, that will close that gap and make it much easier to build tools on top of the CVs (for example, I wouldn't want to be building on top of our CVs because we've broken from our implied schema twice in the last two days).

This is true, it also abstracts information that should not be part of the project away, simplifying things markedly - e.g. we have a single definitive source of institution_id information, which is consistent across projects, same for frequency, etc etc

The core point of our November 2023 CVs, variables and infrastructure drop-in (slides here), was let's stop recreating things over and over and over again (CMIP3, CMIP5, CMIP6, CORDEX CMIP5/CMIP6/CMIP7, input4MIPs 6.0/6.5-7.0, obs4MIPs, etc etc), and rather stabilize around the reused components, standardize these, and then let projects leverage these without having to come up with their own duplicated nomenclature that is not centrally reusable..

0 replies

durack1 · 2025-03-10T19:42:58Z

durack1
Mar 10, 2025
Maintainer

@znichollscr migrating this to a discussion to clean things up a bit

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarifying the point of the CVs #211

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 6 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Clarifying the point of the CVs #211

Uh oh!

Uh oh!

znichollscr Feb 25, 2025 Maintainer

The problem

What I hope to get out of this

The use modes that I think cause confusion

Use mode implied by the name: defining allowed values

Use mode in practice: an information source

Why I think this ambiguity causes confusion

A way out of this

An implication

Conclusion

Replies: 6 comments

Uh oh!

Uh oh!

ltroussellier Feb 26, 2025 Collaborator

Uh oh!

znichollscr Feb 26, 2025 Maintainer Author

Uh oh!

durack1 Feb 26, 2025 Maintainer

1. Capture project information as it evolves

2. Format machine-readable information, enabling CMIP6/project compliant data writing

3. Once data written, facilitate ESGF publication

4. Once data published, facilitate data discovery and interpretation, e.g.

Uh oh!

znichollscr Feb 26, 2025 Maintainer Author

Uh oh!

durack1 Feb 26, 2025 Maintainer

Uh oh!

durack1 Mar 10, 2025 Maintainer

znichollscr
Feb 25, 2025
Maintainer

ltroussellier
Feb 26, 2025
Collaborator

znichollscr
Feb 26, 2025
Maintainer Author

durack1
Feb 26, 2025
Maintainer

znichollscr
Feb 26, 2025
Maintainer Author

durack1
Feb 26, 2025
Maintainer

durack1
Mar 10, 2025
Maintainer