Skip to content

Enhancement: Preserve Methodology/Protocol Version Information #129

@MMobir

Description

@MMobir

Enhancement: Preserve Methodology/Protocol Version Information

Background

Hi CarbonPlan team! 👋

We're building a carbon intelligence platform and rely heavily on OffsetsDB. Thank you for making this open source!

While working with the data, we noticed that methodology/protocol version information is present in raw registry data but gets dropped during transformation. This information would be valuable for:

  • Tracking methodology evolution over time
  • Identifying projects using outdated or problematic methodology versions
  • Analyzing credit quality correlations with protocol versions
  • Providing more granular filtering capabilities
  • Tracking which projects have which labels (e.g. CCP label)

Current Behavior

Raw Data Contains Versions

Looking at the raw registry data, methodology versions are present in the source columns:

  • Verra: Methodology column contains strings like "VM0047: ARR - Version 1.0"
  • ACR: Project Methodology/Protocol contains version info
  • Gold Standard: Methodology field includes versions
  • CDM protocols: Multiple version variations (e.g., "ACM0002 version 21.0", "ACM0001 v19.0")

Versions Get Dropped During Transformation

The all-protocol-mapping.json maps all versioned strings to a single normalized protocol:

{
  "acm0002": {
    "known-strings": [
      "ACM0002 version 21.0",
      "ACM0002 version 20.0",
      "ACM0002 v21.0",
      ...
    ]
  }
}

All these map to just "acm0002" - the version information is lost.

Proposed Solution

Phase 1: Preserve Original Protocol String (Simplest)

Add one field to the project schema:

# In models.py
'original_protocol': pa.Column(pa.String, nullable=True)

The original_protocol column already exists in the transformation pipeline via projects-raw-columns-mapping.json, but currently gets dropped before final output. We just need to keep it through to the final schema.

Pros:

  • Minimal code changes
  • Preserves complete raw data for downstream parsing
  • Maintains traceability to registry source

Cons:

  • Raw strings are messy and inconsistent
  • Downstream users must parse versions themselves

Phase 2: Parse and Extract Versions (Enhancement)

If Phase 1 is accepted, we could add:

'protocol_version': pa.Column(pa.Object, nullable=True)  # Array of version strings

With extraction logic to parse versions from the original strings using regex patterns.

Challenge noted: As mentioned in your email response, linking versions to specific credit issuances (not just projects) could be tricky depending on registry data availability. We're proposing project-level version tracking first, which should be straightforward.

Multi-Protocol Handling

I noticed the current approach handles multi-protocol projects (e.g., "ACM0001; ACM0022") by manually adding them to multiple protocol mappings. This works but:

  1. Requires exhaustive manual curation of all combinations
  2. Doesn't preserve which version of each protocol

A structured approach could handle this better:

'protocol_details': pa.Column(pa.Object, nullable=True)
# Example: [
#   {"protocol": "acm0001", "version": "19.0"},
#   {"protocol": "acm0022", "version": "3.0"}
# ]

Questions for Maintainers

  1. Would you accept a PR that preserves the original_protocol field in the final schema?
  2. Do you have preferences on how version information should be structured?
  3. For multi-protocol projects, would you prefer:
    • Keep current manual mapping + add raw string preservation?
    • Implement automatic parsing/splitting of protocol strings?
  4. Are there any other considerations we should be aware of?

Implementation Offer

We're happy to implement this if you're interested! We could:

  1. Start with Phase 1 (preserve original strings) as a quick win
  2. Follow up with Phase 2 (version parsing) if desired
  3. Write tests to ensure data quality
  4. Update documentation

Looking forward to your feedback!


Related: This would complement the existing protocol harmonization work in all-protocol-mapping.json rather than replace it. The normalized protocol field would remain unchanged, with version info as an additional field.

``

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions