-
-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Enhancement: Preserve Methodology/Protocol Version Information
Background
Hi CarbonPlan team! 👋
We're building a carbon intelligence platform and rely heavily on OffsetsDB. Thank you for making this open source!
While working with the data, we noticed that methodology/protocol version information is present in raw registry data but gets dropped during transformation. This information would be valuable for:
- Tracking methodology evolution over time
- Identifying projects using outdated or problematic methodology versions
- Analyzing credit quality correlations with protocol versions
- Providing more granular filtering capabilities
- Tracking which projects have which labels (e.g. CCP label)
Current Behavior
Raw Data Contains Versions
Looking at the raw registry data, methodology versions are present in the source columns:
- Verra:
Methodologycolumn contains strings like"VM0047: ARR - Version 1.0" - ACR:
Project Methodology/Protocolcontains version info - Gold Standard:
Methodologyfield includes versions - CDM protocols: Multiple version variations (e.g.,
"ACM0002 version 21.0","ACM0001 v19.0")
Versions Get Dropped During Transformation
The all-protocol-mapping.json maps all versioned strings to a single normalized protocol:
{
"acm0002": {
"known-strings": [
"ACM0002 version 21.0",
"ACM0002 version 20.0",
"ACM0002 v21.0",
...
]
}
}All these map to just "acm0002" - the version information is lost.
Proposed Solution
Phase 1: Preserve Original Protocol String (Simplest)
Add one field to the project schema:
# In models.py
'original_protocol': pa.Column(pa.String, nullable=True)The original_protocol column already exists in the transformation pipeline via projects-raw-columns-mapping.json, but currently gets dropped before final output. We just need to keep it through to the final schema.
Pros:
- Minimal code changes
- Preserves complete raw data for downstream parsing
- Maintains traceability to registry source
Cons:
- Raw strings are messy and inconsistent
- Downstream users must parse versions themselves
Phase 2: Parse and Extract Versions (Enhancement)
If Phase 1 is accepted, we could add:
'protocol_version': pa.Column(pa.Object, nullable=True) # Array of version stringsWith extraction logic to parse versions from the original strings using regex patterns.
Challenge noted: As mentioned in your email response, linking versions to specific credit issuances (not just projects) could be tricky depending on registry data availability. We're proposing project-level version tracking first, which should be straightforward.
Multi-Protocol Handling
I noticed the current approach handles multi-protocol projects (e.g., "ACM0001; ACM0022") by manually adding them to multiple protocol mappings. This works but:
- Requires exhaustive manual curation of all combinations
- Doesn't preserve which version of each protocol
A structured approach could handle this better:
'protocol_details': pa.Column(pa.Object, nullable=True)
# Example: [
# {"protocol": "acm0001", "version": "19.0"},
# {"protocol": "acm0022", "version": "3.0"}
# ]Questions for Maintainers
- Would you accept a PR that preserves the
original_protocolfield in the final schema? - Do you have preferences on how version information should be structured?
- For multi-protocol projects, would you prefer:
- Keep current manual mapping + add raw string preservation?
- Implement automatic parsing/splitting of protocol strings?
- Are there any other considerations we should be aware of?
Implementation Offer
We're happy to implement this if you're interested! We could:
- Start with Phase 1 (preserve original strings) as a quick win
- Follow up with Phase 2 (version parsing) if desired
- Write tests to ensure data quality
- Update documentation
Looking forward to your feedback!
Related: This would complement the existing protocol harmonization work in all-protocol-mapping.json rather than replace it. The normalized protocol field would remain unchanged, with version info as an additional field.
``