Skip to content

ImmuneSpace to AKC conversion#20

Draft
LonnekeScheffer wants to merge 16 commits into
mainfrom
immunespace_integration
Draft

ImmuneSpace to AKC conversion#20
LonnekeScheffer wants to merge 16 commits into
mainfrom
immunespace_integration

Conversation

@LonnekeScheffer
Copy link
Copy Markdown
Contributor

@LonnekeScheffer LonnekeScheffer commented Feb 20, 2025

I implemented the script immunespace_to_akc.py which takes data from ImmuneSpace and converts it to an AKC output Json file.

  • Input: two SQLite database files, representing data from ImmuneSpace and ImmPort.
    • I mainly use the ImmuneSpace tables, additional data can be retrieved from ImmPort and may be added to ImmuneSpace in the future. The data represents one investigation: SDY460 (https://immunespace.org/query/study/SDY460)
  • I also added unit tests, primarily focusing on ensuring akc_ids are linked between objects as expected.
  • For now, the crucial parts still missing from this conversion are 'metadata' specimen_collections/specimen_processings/assays and 'data' chains/tcell_receptors/epitopes. These are not (completely) in ImmuneSpace yet (as far as I know) -> I think this would be nice to discuss in the next Schema meeting, ideally Bjoern should be present for the big picture overview.

Points I'd like input on

If you agree with my current/suggested solution feel free to just check the box without commenting.

  • @schristley @jamesaoverton @bcorrie AKC has inclusion_criteria both on StudyArm level, and on the full Investigation level. The descriptions the same in both cases (StudyArm). I presume that on the StudyArm level, these criteria relate to inclusion/exclusion in the arm, not the whole study? As I wrote this I found Investigation should not have inclusion_criteria and exclusion_criteria issues#62 and commented there, feel free to continue the discussion there. Here is what I did for now:

    • Investigation level: retrieve from immPort study level (examples: inclusion: [“No acute illness at time of vaccination”], exclusion [“Autoimmune disease”])
    • StudyArm level: retrieve from ImmuneSpace description (example: inclusion: [“18-30 year old adults”], exclusion: here always None)
  • @schristley For Reference: source_uri is the PMID, sources includes both PMID and doi (let me know if any of that should change). Just wanted to note: the type for source_uri somehow gets converted to a ReferenceSourceUri object instead of Curie, without ‘resolving’ the Curie into an actual URI. Will that be an issue? I’m not sure where this behavior originates since the source_uri slot has range: uriorcurie in the ak_schema.yaml

  • @schristley / @jamesaoverton: identifiers are still a little messy and I'd like your input. For now, they follow the format ImmuneSpace:{name of the item}-{identifier}. For instance: "ImmuneSpace:investigation-SDY460", "ImmuneSpace:arm-ARM2480", "ImmuneSpace:participant-SUB134239". These look like CURIEs but they are not. Interested to hear your take on what the IDs should be. Some thoughts:

    • For the Investigation, it is possible to make a CURIE (ImmuneSpace:SDY460).
    • For all other objects, you'd probably want to use a combination of "ImmuneSpace" (?) + study id + object id (+ object type? -> probably not necessary if we can presume that ImmuneSpace identifiers always follow a different formatting for different types of objects. The only reason I care about this is some overly precautious way of avoiding duplicate IDs @jamesaoverton).
  • study_type is now always OBI:0000066 (investigation)

    • @jamesaoverton is it sufficient to always use OBI:0000066 for ImmuneSpace studies, or should a subclass sometimes be used? and in the latter case, what criteria should I use to determine the most appropriate subclass?
    • @schristley @bcorrie the range for study_type is now not decided manually, but determined by the auto-converted AIRR LinkML. The issue here is that the source node is by AIRR defined as NCIT:C63536. James brought up that OBO ontologies are preferential, and that new terms can be added if needed for the AKC. While being able to use AIRR-based LinkML is great (e.g., all Rearrangement related stuff), but being forced to use a range defined by the AIRR schema seems like a limitation here (I'm here not worried about this one field, but thinking ahead a bit if this ends up happening regularly).
  • Investigation.archival_id @schristley would it suffice here to have a CURIE pointing towards ImmuneSpace: https://immunespace.org/query/study/SDY460 ? Or would you specifically want the BioProject ID? The former can be done very easily, the latter is a small workaround but still possible.

  • I'm not sure yet how to convert to ontologies with a flexible vocabulary (permissible values tbd based on ontology root node), or if some functionality still needs to be implemented for that? @schristley

  • Dealing with enum case: IEDB has value "Documented exposure without evidence for disease", I now imported a bunch of enum values from ImmuneSpace, where this one happened to be overwritten by "documented exposure without evidence for disease". Sure, can be fixed in many easy ways, but maybe we should decide on a specific way of dealing with synonymous enums? @schristley @bcorrie

  • minor point: for references, while in theory multiple studies/investigation can relate to 1 reference, in this conversion the number of investigations per reference is always 1 (the study that is being converted). In theory, the same reference (PMID) could come up in multiple studies, and those may need to be merged somewhere, somehow. I believe this is not something to resolve at the level of this script though.

  • minor point: I didn't find phenotypic_sex in ImmuneSpace, only biological_sex. I think it's not in ImmPort either (I only see 'gender') @jamesaoverton can you confirm? I think we can live without it.

  • minor point: ImmuneSpace has 'race' and 'race specify'. As far as I understand 'race' is a closed vocabulary and in 'race specify' people can fill in whatever they want if they selected race: other. I omitted 'race specify' information for now, keeping only 'race'.

  • minor point: @jamesaoverton can you confirm that the author list is always formatted the same in ImmuneSpace? (i.e., I can safely split on ", " to get a list of author names?)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 20, 2025

PR Preview Action v1.6.2

🚀 View preview at
https://airr-knowledge.github.io/ak-schema/pr-preview/pr-20/

Built to branch gh-pages at 2025-08-14 00:31 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@bcorrie
Copy link
Copy Markdown
Collaborator

bcorrie commented Feb 20, 2025

  • For Reference: source_uri is the PMID, sources includes both PMID and doi (let me know if any of that should change)

In the ADC the pub_ids field is an array of CURIEs that can be either PMID or DOI - see https://github.com/airr-community/airr-standards/blob/70f7a00e3f9adfb19c7eaa1a0213ece2cf049f50/specs/airr-schema-openapi3.yaml#L2081

Currently the ADC to AKC conversion just copies the value. We need to decide what this should be for the AKC. The AKC Reference.sources is an array of uriorcurie type, which transforms nicely from the ADC.

@bcorrie
Copy link
Copy Markdown
Collaborator

bcorrie commented Feb 20, 2025

  • the range for study_type is now not decided manually, but determined by the auto-converted AIRR LinkML. The issue here is that the source node is by AIRR defined as NCIT:C63536. James brought up that OBO ontologies are preferential, and that new terms can be added if needed for the AKC. While being able to use AIRR-based LinkML is great (e.g., all Rearrangement related stuff), but being forced to use a range defined by the AIRR schema seems like a limitation here (I'm here not worried about this one field, but thinking ahead a bit if this ends up happening regularly).

I agree, I don't think the AKC should be limited by what any of the component repositories limitations are. I think that we should determine the "right" definition for the AKC. The challenge then becomes mapping the data in the component repositories (e.g. the ADC study_type) to the appropriate definition in the AKC. If the AKC defines its own ontology (or an enum populated with allowable ontology terms) with additions like those from NCIT then this is easy. If we want to use a predefined study ontology, then we have to map terms between repositories (NCIT->AKC Ontology).

With that said, I believe NCIT is an OBO onotology, no? This would be the normal term used for a Case-Control Study in the ADC:

https://www.ebi.ac.uk/ols4/ontologies/ncit/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FNCIT_C15197

@schristley
Copy link
Copy Markdown
Member

Hi @LonnekeScheffer , that's great progress! Before we dig too deep into the code and implementation, we should decide on where code will ultimately reside. While it's fine to have an example in ak-schema, we want the actual production programs to be outside the schema in its own repository. There's ak-etvl where I've been working on the main integration scripts, so I think that might be appropriate place for your transform script. We could also consider a separate repository if that makes sense.

In ak-etvl, I started putting together common shared functions in ak_schema_utils.py and you can take a look at iedb_transform.py which is where I turned James' initial example into a more complete script. There's a bunch of extraneous stuff at the end where I've been working through generating CSV files for easy import into the database, that's also stuff that needs to be generalized into common functions at some point.

@schristley
Copy link
Copy Markdown
Member

  • For now, the crucial parts still missing from this conversion are 'metadata' specimen_collections/specimen_processings/assays and 'data' chains/tcell_receptors/epitopes. These are not (completely) in ImmuneSpace yet (as far as I know) -> I think this would be nice to discuss in the next Schema meeting, ideally Bjoern should be present for the big picture overview.

Yes, in particular the sequence data because just the study data isn't that useful without it. Here's my concept of how the process will work:

  • Somehow we know a new study is available with AIRR-seq data and we have an ID for it
  • Study data gets transformed: ImmuneSpace -> AK Schema -> AIRR JSON
  • The BioProject ID is used to search SRA (or other archive) to find AIRR-seq FASTQ files
  • Those FASTQ files are downloaded from SRA and uploaded to a VDJServer project
  • AIRR JSON is imported into VDJServer project: study, subject, sample, repertoires and repertoire groups are defined and associated with their appropriate sequence files
  • pre-processing of the FASTQ files
  • VDJ annotation of the sequences
  • Load study into the AIRR Data Commons
  • Load study into AKC

@bcorrie
Copy link
Copy Markdown
Collaborator

bcorrie commented Feb 21, 2025

With that said, I believe NCIT is an OBO onotology, no? This would be the normal term used for a Case-Control Study in the ADC:

https://www.ebi.ac.uk/ols4/ontologies/ncit/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FNCIT_C15197

@LonnekeScheffer I see that the OBO base node for an Investigation that you mentioned (OBI:0000066) is a high level description with no subclasses to describe the type of investigation. So if want the OBI investigation to be part of the AKC enum, because we want to be able to use the logic that describes that ontology term, then we would still want to use the NCIT ontology terms as well to provide a more detailed description of the type of study. So as @jamesaoverton suggests, "... that OBO ontologies are preferential, and that new terms can be added if needed for the AKC" seems like the way to go.

@jamesaoverton would you envision this as adding new OBI terms under Investigation (OBI:0000066) through the formal OBO process (so OBI:0000066 would have children that described the type of investigation) or would we create a LinkML enum that has NCIT terms in it and has a class_uri: OBI:0000066 attribute?

@jamesaoverton
Copy link
Copy Markdown
Collaborator

jamesaoverton commented Feb 21, 2025

NCIT is not an OBO ontology, despite those OBO PURLs. There's a translation of NCIT into OBO-compatible annotations ("NCI Thesaurus OBO Edition"), but it does not follow OBO principles or best practices, does not (re)use terms from other OBO projects, and it never will. Yes, this is very confusing, and after many years I'm still upset about the situation.

We can easily add subclasses of investigation to OBI. We tend to distinguish investigations using other properties, such as study designs and assays, but we've been having recent conversations about just adding more investigation subclasses for ImmuneSpace and others to use.

@bcorrie
Copy link
Copy Markdown
Collaborator

bcorrie commented Feb 22, 2025

In the AIRR Standard, the field study_type is used to capture the study design, so this would be more equivalent to OBI study_design as defined here: http://purl.obolibrary.org/obo/OBI_0500000. In the AIRR Standard we use NCIT:C63536 as the root node, and it has similar child nodes as OBI:0500000 (e.g. Case-Control Study). Although NCIT has a large number of child nodes, in practice there are only 6 child nodes used out of all of the current studies in the ADC. So it doesn't seem to me like we need to add more subclasses to OBI investigation but instead we should be considering using OBI study_design and maybe (but maybe not) adding more subclassed to that.

So for the AKC, I think we need to decide whether we want to use an OBI ontology tree (starting at study_design, OBI:0500000 as the "base node") for the AKC study_type field. If so, then we need to define a mapping from an NCIT study_type to an OBI study_design for the translation from the ADC to the AKC. I don't think this would be too hard. Currently the ADC has these:

  • Study
  • Case-Control Study
  • Observational Study
  • Longitudinal Study
  • Nested Case-Control Study
  • Phase II Trial
  • Animal Study

For the most part there is a one-to-one mapping.

If we did that and ImmuneSpace also used OBI study_type and its children to denote study design, then that would work. @LonnekeScheffer it sounds to me currently like there is no use of study_design in ImmuneSpace since you said that everything was an investigation, is that correct?

@schristley
Copy link
Copy Markdown
Member

schristley commented Feb 22, 2025

in practice there are only 6 child nodes used out of all of the current studies in the ADC.

Thanks for checking that, I thought it was a low number too.

For the most part there is a one-to-one mapping.

Ok, I'm good with the idea of using OBI so long as we can map to the same semantics. I see some of the mappings but not sure about these:

  • Nested Case-Control Study
  • Phase II Trial
  • Animal Study

If yes, then need to change Investigation to use a new slot instead of study_type, maybe study_design?

@jamesaoverton Is this where we supposedly use SSSOM to describe the mapping from the NCIT to OBI ontologies? Ah, I saw too late your previous comment about a translation being available, so I guess we should try to use that.

@LonnekeScheffer
Copy link
Copy Markdown
Contributor Author

If we did that and ImmuneSpace also used OBI study_type and its children to denote study design, then that would work. @LonnekeScheffer it sounds to me currently like there is no use of study_design in ImmuneSpace since you said that everything was an investigation, is that correct?

Apologies for the late response. I didn't find anything in ImmuneSpace mapping to something that could be considered a study type or design. James recommended me to use the OBI 'Investigation' but since this is now hardcoded for all ImmuneSpace studies it's not super meaningful. I looked through ImmPort and didn't find anything I'd confidently say describes the same thing (closest I found was 'research focus' which can be stuff like 'vaccine response', 'epidemiology', 'computational modeling' to name a few) @jamesaoverton do you know of anything in ImmuneSpace or ImmPort that would fit under the study design as described by Brian?

@bcorrie
Copy link
Copy Markdown
Collaborator

bcorrie commented Feb 26, 2025

NCIT is not an OBO ontology, despite those OBO PURLs. There's a translation of NCIT into OBO-compatible annotations ("NCI Thesaurus OBO Edition"), but it does not follow OBO principles or best practices, does not (re)use terms from other OBO projects, and it never will. Yes, this is very confusing, and after many years I'm still upset about the situation.

@jamesaoverton naive ontology person questions/comments:

The OBO study_design (OBI:0500000) doesn't seem very complete to me, at least compared to the richness of NCIT. The AIRR Community ontology group explored ontologies pretty extensively when they were chosen, and NCIT seemed to be the most appropriate to describe what the AIRR Community seemed to feel was important (See https://docs.airr-community.org/en/latest/ontovoc/introduction_ontovoc.html).

So my question would be why isn't the OBO study_design more complete? 8-)

In particular, if there are other study design ontologies that are more complete (e.g. NCIT), but don't adhere to the OBO principles etc, why hasn't the OBO community built on the work of others like NCIT and add the features/terms that are missing? I am not sure what would be required such that NCIT did follow OBO best practices/principles and reuse terms from other OBO projects. I realize that this may be a lot of work, and you would hope that the NCIT folks would do this (you suggest they haven't and never will which seems unfortunate), but now it seems like we have the worst of both worlds. An NCIT study design ontology that is complete, but can't be integrated with other OBO ontologies and an incomplete OBO study design ontology that doesn't meet the needs of the community - at least it didn't meet the needs of the AIRR Community based on the OntoVoc working groups assessment at the time.

In our current case, the fact that only a small number of terms from NCIT are used currently in the ADC, and that there is a reasonable mapping of those terms into OBO study_design fields, makes it possible to do this transformation. But I would argue that it would be desirable to have a more rich OBO study_design. It is not hard for me to envision cases where a study design would be more accurately described currently with NCIT than it could be with OBO study_design. If it is easy to add ontology terms, why don't we just add the subtrees from NCIT that we think are appropriate to OBO study_design. And if this is easy, why hasn't someone already done this 8-)

@schristley
Copy link
Copy Markdown
Member

@LonnekeScheffer Followup from our 1-on-1 discussion, I looked at my current development tree, for ak-etvl the ak-schema submodule is set to the integration-prototype branch. I can merge it but I expect that there will be more schema modifications as the integration work progresses.

My suggestion is to "wrap up" this integration example then move your integration script over to ak-etvl. Some of your ontology errors may resolve themselves if your ak-schema is on that integration branch.

@jamesaoverton
Copy link
Copy Markdown
Collaborator

@bcorrie NCIT is a thesaurus. It's a bunch of labels and textual definitions, roughly collected by topic into a tree. It has almost 30,000 terms, which seems impressive, until you realize that it spans a couple dozen branches from genes to organisms to diseases to food.

OBO is a community of ontologies. We have a couple hundred ontologies (not all of them great), covering millions of terms. Because they're ontologies, they have logical definitions that automated reasoners and query engines can use -- not just string searches. Because they follow a shared set of principles and best practices, the logical definitions can link many ontologies together. Gene stuff goes in the Gene Ontology, organisms are in the NCBI Taxonomy, diseases are in various species-specific disease ontologies, and there's a whole Food Ontology, and hundreds more.

What NCIT has that OBI does not have is funding. OBI, like most OBO ontologies, relies on the funded projects that use it to donate some time and submit new terms. Adding more study designs is as easy as making a PR adding rows to this table, and passing review: https://github.com/obi-ontology/obi/blob/master/src/ontology/templates/study-designs.tsv. I don't know how you get a fix or a new term into NCIT.

@bcorrie
Copy link
Copy Markdown
Collaborator

bcorrie commented Feb 27, 2025

What NCIT has that OBI does not have is funding. OBI, like most OBO ontologies, relies on the funded projects that use it to donate some time and submit new terms. Adding more study designs is as easy as making a PR adding rows to this table, and passing review: https://github.com/obi-ontology/obi/blob/master/src/ontology/templates/study-designs.tsv.

I think this is my point. I am surprised that no one in the OBI world has added more complete terms and structure to OBI study_design since it seems foundational to OBI's mandate - in particular when you have a "nice" subtree like the NCIT study tree to build from. Because NCIT is a tree it at least has an is-a relationship that can be used.

Maybe we do this through AKC, as one of those funded projects that contributes, but as I say I am surprised that the OBI study_design is not more rich already...

@schristley do we take the high road and try to make OBI study_design more complete?

@jamesaoverton would it be considered bad form to just dump the hierarchy for a set of NCIT subtrees into the OBI table and make the pull request 8-) I could see adding NCIT trees like Observational Study, Case Series, Clinical Study, Case-Control Study and it seems like they could be added fairly easily given the table - but that would probably double the size of OBI study_design...

The fact that this hasn't been done before makes me a bit skeptical about how easy you say it is 8-)

@jamesaoverton
Copy link
Copy Markdown
Collaborator

Sorry, I don't understand. NCIT has NCIT:C15320 'Study Design' with 166 descendant terms, but those descendants include terms such as 'Study Arm' with 61 children. It is simply false to say that "'Study Arm' is a 'Study Design'".

https://www.ebi.ac.uk/ols4/ontologies/ncit/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FNCIT_C15320

On the other hand we have OBI, with OBI:0500000 'study design' which has 109 descendants in OLS that are either actual subclasses or parts, clearly distinguished with logical axioms. I don't see how NCIT is better than OBI on this measure.

https://www.ebi.ac.uk/ols4/ontologies/obi/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FOBI_0500000

In practice, OBI users use more detailed terms and more specific axioms, and not the broad high-level terms that NCIT uses. But there's a clear path to add these broad terms to OBI.

OBO projects all have open licenses. I'm not clear on the license for NCIT, so I wouldn't be comfortable copy-pasting from it into OBI until that's clarified.

@jamesaoverton
Copy link
Copy Markdown
Collaborator

Now I see that NCIT's 'Study' term has a couple hundred subclasses that have convenient labels https://www.ebi.ac.uk/ols4/ontologies/ncit/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FNCIT_C63536?lang=en which we would model in OBI by logically expressing the parts and their relations. But I admit that sometimes broad terms are helpful.

@bcorrie
Copy link
Copy Markdown
Collaborator

bcorrie commented Feb 27, 2025

Now I see that NCIT's 'Study' term has a couple hundred subclasses that have convenient labels https://www.ebi.ac.uk/ols4/ontologies/ncit/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FNCIT_C63536?lang=en which we would model in OBI by logically expressing the parts and their relations. But I admit that sometimes broad terms are helpful.

Yes, that is the hierarchy I was talking about, but when you pointed to the other one I admit I got very confused - thought I was looking at the same thing but they were different. Sheesh, this is a messy world. 8-)

It looks like NCIT is CC-BY-4.0 (https://bioportal.bioontology.org/ontologies/NCIT) - presumably the approach one would take in using NCIT as a source for the creation of OBI terms would be to create an OBI term, define any required axioms for that term, and reference the NCIT term as a synonym (or the "definition source" in the table)???

Correct me if I am wrong, but one of the nice things about the NCIT Study is that it has quite a bit more structure to it than OBI, in that its tree is more "complex" and therefore there are a lot of is-a relationships built in (don't ask me if the tree encapsulates a good set of relationships). The OBI study_design is relatively flat, so there are not a lot of inferences that you can do between the entities within study_design.

Again, correct me if I am wrong... For case-control study (which we use in the ADC) https://ontobee.org/ontology/OBI?iri=http://purl.obolibrary.org/obo/OBI_0002624 it has a super class of study_design but no other axioms related to it that I can see. So the OBI case-control study definition is not really any more complex (from a rule/axiom perspective) than the NCIT one is it? So for OBI case-control study you inherit all of the axioms/relationships from its super class, but there are no axioms involving the term case-control study itself that infer other relationships to other OBI entities. Or am I missing something?

Bottom line, I am not sure what the correct approach is here, but I think I am slowly wrapping my head around the differences 8-)

@bcorrie
Copy link
Copy Markdown
Collaborator

bcorrie commented Mar 5, 2025

In meeting today, discussed that using OBI was best, will add terms used by ADC (listed here: #20 (comment)) to OBI - James to make request to LJI team for this. OBI term would have as a reference the source NCIT terms so mapping would be possible.

Only potential caveat is that if an ADC study is added with a different NCI term that is not in OBI we would have a gap...

@schristley any thoughts/concerns?

@schristley
Copy link
Copy Markdown
Member

@LonnekeScheffer

  • I'm not sure yet how to convert to ontologies with a flexible vocabulary (permissible values tbd based on ontology root node), or if some functionality still needs to be implemented for that?
  • minor point: I didn't find phenotypic_sex in ImmuneSpace, only biological_sex. I think it's not in ImmPort either (I only see 'gender')

Values need to be ontology IDs, I've changed the code for a simple example for biological sex, using PATO IDs. You'll need to do this for LifeEventProcessOntology and ExposureMaterialOntology values as those are still throwing errors.

schristley and others added 7 commits March 12, 2025 19:54
# Conflicts:
#	project/jsonld/ak_schema.jsonld
#	project/owl/ak_schema.owl.ttl
#	project/sqlddl/ak_schema.sql
#	src/ak_schema/datamodel/ak_schema.py
making a meaningless change (added space) to test if pushing via GitHub works
- update according to new data model
- for now, not use ImmPort db (simplified)
- make random UUID identifiers (and export mapping from legible to random)
- started working on script to export VDJServer metadata files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants