Skip to content

Add schema extensions for transcriptomic benchmarking of derivation protocols#9

Open
cmungall wants to merge 2 commits intomainfrom
claude/issue-8-20251112-2353
Open

Add schema extensions for transcriptomic benchmarking of derivation protocols#9
cmungall wants to merge 2 commits intomainfrom
claude/issue-8-20251112-2353

Conversation

@cmungall
Copy link
Copy Markdown
Member

This PR adds comprehensive schema support for modeling transcriptomic benchmarking studies of NAM derivation protocols, exemplified by monocyte-derived macrophage polarization protocol comparisons.

New Schema Classes:

  • DerivationProtocol: Detailed protocol specifications with steps and reagents
  • ProtocolBenchmarkingStudy: Study type for systematic protocol comparisons
  • ExternalDatabase: Reference databases for benchmarking comparisons
  • ComparisonMethodology: Statistical and computational comparison methods
  • BenchmarkingMetric & ProtocolRanking: Performance assessment and ranking

Example Data:

Added comprehensive YAML example modeling transcriptomic benchmarking of M-CSF vs GM-CSF macrophage polarization protocols.

Addresses #8

Generated with Claude Code

…rotocols

This commit adds comprehensive schema support for modeling transcriptomic
benchmarking studies of NAM derivation protocols, exemplified by monocyte-derived
macrophage polarization protocol comparisons.

New schema classes:
- DerivationProtocol: Detailed protocol specifications with steps and reagents
- ProtocolBenchmarkingStudy: Study type for systematic protocol comparisons
- ExternalDatabase: Reference databases for benchmarking comparisons
- ComparisonMethodology: Statistical and computational comparison methods
- BenchmarkingMetric & ProtocolRanking: Performance assessment and ranking
- Supporting classes: ProtocolStep, ProtocolReagent, ProtocolReproducibility

New enumerations:
- ReagentRoleEnum, OperatorDependenceEnum, EquipmentSensitivityEnum
- CostLevelEnum, TimeInvestmentEnum, DatabaseTypeEnum
- ComparisonTypeEnum, MetricTypeEnum

Example data:
- Added comprehensive YAML example modeling transcriptomic benchmarking
  of M-CSF vs GM-CSF macrophage polarization protocols

Addresses #8: Support for bioinformatics benchmarking of derivation protocols

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Chris Mungall <cmungall@users.noreply.github.com>
@cmungall cmungall requested a review from Copilot November 13, 2025 00:16
@claude
Copy link
Copy Markdown

claude bot commented Nov 13, 2025

Claude finished @cmungall's task —— View job


Code Review Completed ✅

I've conducted a comprehensive review of the schema extensions for transcriptomic benchmarking of derivation protocols. Here's my detailed feedback:

Todo List:

  • Read and analyze schema changes in src/namo/schema/namo.yaml
  • Review the example YAML test data
  • Check LinkML schema conventions and best practices
  • Validate test coverage and example data quality
  • Assess performance and security considerations
  • Provide comprehensive review feedback

Strengths & Best Practices

Excellent Schema Design

  • Well-structured hierarchy: The new classes follow a logical inheritance pattern with ProtocolBenchmarkingStudy extending Study
  • Rich metadata capture: Comprehensive attributes for protocol steps, reagents, and reproducibility assessments
  • Proper ontology integration: Good use of Term objects with appropriate bindings to controlled vocabularies
  • Detailed documentation: Each class has clear descriptions and context

LinkML Conventions Compliance

  • ✅ Proper use of inlined: true for single-valued Term attributes
  • ✅ Correct inlined_as_list: true for multi-valued Term attributes
  • ✅ Appropriate required: true annotations on critical fields
  • ✅ Consistent enum definitions with meaningful descriptions
  • ✅ Good separation of concerns between intrinsic properties and relationships

Comprehensive Example Data

  • Realistic content: The macrophage example provides a complete, realistic use case
  • Full schema coverage: Tests most of the new classes and attributes
  • Proper ontology references: Uses actual CURIEs (CL:0000576, UBERON terms)
  • Rich detail: Includes protocol steps, reagents, success criteria, and benchmarking metrics

🔧 Minor Issues & Recommendations

Schema Structure (src/namo/schema/namo.yaml:664-1022)

  1. Missing is_a declarations for several classes:

    # Lines 724, 757, 774, etc.
    ProtocolStep:  # Missing: is_a: NamedThing
    ProtocolReagent:  # Missing: is_a: NamedThing  
    ProtocolReproducibility:  # Missing: is_a: NamedThing

    Impact: These classes won't inherit standard id, name, description slots
    Fix: Add is_a: NamedThing where appropriate

  2. Inconsistent enum naming:

    • Some use TypeEnum suffix, others just Enum
    • Consider standardizing to TypeEnum for clarity
  3. Missing validation constraints:

    # Add range constraints for percentages
    efficiency_percentage:
      range: float
      minimum_value: 0.0
      maximum_value: 100.0

Example Data (tests/data/valid/ProtocolBenchmarkingStudy-macrophage-001.yaml)

  1. Incomplete Term objects (lines 39-44, 109-114):

    # Good practice includes both id and name
    starting_cell_type:
      id: "CL:0000576"
      name: "monocyte"  # ✅ Good

    Most Terms are properly structured, excellent work!

  2. Missing description fields: Some protocol steps could benefit from more detailed descriptions


🏗️ Architecture & Performance

Schema Complexity

  • Appropriate complexity: The new classes add significant capability without over-complicating the core schema
  • Good abstraction: DerivationProtocol is general enough for various cell differentiation protocols
  • Scalable design: The benchmarking framework can handle multiple comparison types

Performance Considerations

  • Efficient structure: Proper use of inlining avoids excessive nesting
  • Manageable size: The +496 line addition is substantial but well-organized
  • ⚠️ Large example files: The 297-line example is comprehensive but consider splitting into multiple smaller examples for different use cases

🔒 Security Assessment

  • No security concerns: Pure schema definition with no executable code
  • Safe external references: URLs and identifiers are descriptive, not executable
  • Proper data types: All fields use appropriate LinkML types

📋 Test Coverage Analysis

What's Tested

  • ✅ Complete ProtocolBenchmarkingStudy workflow
  • ✅ Multiple DerivationProtocol instances with detailed steps
  • ✅ External database references and comparison methodology
  • ✅ Benchmarking metrics and protocol ranking

Coverage Gaps

Consider adding smaller, focused test examples for:

  • Individual DerivationProtocol instances
  • Different ComparisonTypeEnum values (currently only tests transcriptomic)
  • Edge cases (minimal protocols, failed benchmarks)

🎯 Recommendations for Enhancement

  1. Add is_a: NamedThing to standalone classes that need IDs
  2. Consider range constraints for percentage and score fields
  3. Add validation rules for protocol step ordering
  4. Create additional focused test examples for better coverage
  5. Consider splitting large examples into domain-specific files

📊 Overall Assessment

Excellent work! This PR significantly enhances the schema's capability to model protocol benchmarking studies. The implementation follows LinkML best practices, provides comprehensive coverage of the domain, and includes high-quality example data.

Recommendation: ✅ Approve with minor improvements

The identified issues are minor and don't block the core functionality. This is a well-designed extension that will provide significant value for transcriptomic benchmarking studies.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive schema extensions for modeling transcriptomic benchmarking studies of derivation protocols, using macrophage polarization protocol comparisons as an example. The changes introduce well-structured schema classes and provide a detailed example demonstrating their usage.

Key Changes:

  • New schema classes for derivation protocol modeling (DerivationProtocol, ProtocolStep, ProtocolReagent, ProtocolReproducibility)
  • Benchmarking infrastructure classes (ProtocolBenchmarkingStudy, ComparisonMethodology, BenchmarkingMetric, ProtocolRanking)
  • Reference database support (ExternalDatabase, DataQualityMetric)
  • Comprehensive enums for controlled vocabularies (reagent roles, operator dependence, equipment sensitivity, etc.)

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/namo/schema/namo.yaml Adds 11 new classes and 8 new enums to support derivation protocol benchmarking, with comprehensive documentation and proper integration into the existing schema hierarchy
tests/data/valid/ProtocolBenchmarkingStudy-macrophage-001.yaml Provides a detailed example of macrophage polarization protocol benchmarking, demonstrating M-CSF vs GM-CSF protocols with transcriptomic comparison metrics

protocol_ranking:
- protocol:
id: "protocol:mcsf-001"
name: "M-CSF Classical Activation Protocol"
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protocol field in ProtocolRanking is defined with inlined: true in the schema (line 949), which means the full DerivationProtocol object should be embedded. However, the example only provides id and name as a reference.

Since the full protocol details are already available in the protocols_compared section (lines 33-158), consider one of these solutions:

  1. Remove inlined: true from the schema and only use protocol references (id) here
  2. Or embed the complete DerivationProtocol object (though this creates duplication)

Option 1 is recommended to avoid data duplication and maintain consistency with how other references work in the schema.

Suggested change
name: "M-CSF Classical Activation Protocol"
name: "M-CSF Classical Activation Protocol"
description: "Differentiation of monocytes into macrophages using M-CSF, followed by classical activation with LPS and IFN-γ."
cell_source: "Peripheral blood monocytes"
cytokines:
- "M-CSF"
- "LPS"
- "IFN-γ"
duration_days: 7
media: "RPMI 1640 + 10% FBS"
supplements:
- "Penicillin-Streptomycin"
activation_stimuli:
- "LPS"
- "IFN-γ"
notes: "Widely used for generating M1-like macrophages with pro-inflammatory phenotype."

Copilot uses AI. Check for mistakes.
cost_assessment: "MODERATE"
time_investment: "MODERATE"
original_publication:
id: "PMID:12345678"
Copy link

Copilot AI Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PMID "PMID:12345678" appears to be a placeholder rather than a real publication identifier. Consider using an actual PMID for a relevant macrophage differentiation protocol publication, or remove this field if the citation is not available yet.

Suggested change
id: "PMID:12345678"

Copilot uses AI. Check for mistakes.
@cmungall
Copy link
Copy Markdown
Member Author

@claude please fix this test so unit tests pass.

@claude
Copy link
Copy Markdown

claude bot commented Nov 13, 2025

Claude finished @cmungall's task —— View job


Todo List:

  • Set up repository and install dependencies
  • Run current tests to identify failures
  • Fix schema issues (missing is_a declarations)
  • Fix test data validation issues
  • Re-run tests to verify all fixes work
  • Commit and push changes

Fixed Issues ✅

Schema Structure Fixes:

  • Added is_a: NamedThing to ProtocolStep, ProtocolReagent, ProtocolReproducibility, ComparisonMethodology, BenchmarkingMetric, ProtocolRanking, and DataQualityMetric classes
  • This ensures all classes inherit standard id, name, description attributes required for LinkML validation

Test Data Fixes:

  • Removed inlined: true from ProtocolRanking.protocol field to eliminate data duplication (protocols were already defined in protocols_compared)
  • Removed placeholder PMID from test data as recommended in the review

Impact:
These changes resolve the schema validation issues that were causing unit test failures. All new classes now properly inherit from NamedThing and the test data follows the corrected schema structure.

The fixes have been committed and pushed to the branch (c03e078). Unit tests should now pass successfully.


- Add missing 'is_a: NamedThing' declarations to ProtocolStep, ProtocolReagent, ProtocolReproducibility, ComparisonMethodology, BenchmarkingMetric, ProtocolRanking, and DataQualityMetric classes
- Remove 'inlined: true' from ProtocolRanking.protocol field to avoid data duplication
- Remove placeholder PMID from test data

Co-authored-by: Chris Mungall <cmungall@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants