ImmuneSpace to AKC conversion by LonnekeScheffer · Pull Request #20 · airr-knowledge/ak-schema

LonnekeScheffer · 2025-02-20T03:41:29Z

I implemented the script immunespace_to_akc.py which takes data from ImmuneSpace and converts it to an AKC output Json file.

Input: two SQLite database files, representing data from ImmuneSpace and ImmPort.
- I mainly use the ImmuneSpace tables, additional data can be retrieved from ImmPort and may be added to ImmuneSpace in the future. The data represents one investigation: SDY460 (https://immunespace.org/query/study/SDY460)
I also added unit tests, primarily focusing on ensuring akc_ids are linked between objects as expected.
For now, the crucial parts still missing from this conversion are 'metadata' specimen_collections/specimen_processings/assays and 'data' chains/tcell_receptors/epitopes. These are not (completely) in ImmuneSpace yet (as far as I know) -> I think this would be nice to discuss in the next Schema meeting, ideally Bjoern should be present for the big picture overview.

Points I'd like input on

If you agree with my current/suggested solution feel free to just check the box without commenting.

…riteria multivalued

github-actions · 2025-02-20T03:42:22Z

PR Preview Action v1.6.2
🚀 View preview at https://airr-knowledge.github.io/ak-schema/pr-preview/pr-20/
Built to branch `gh-pages` at 2025-08-14 00:31 UTC. Preview will be ready when the GitHub Pages deployment is complete.

bcorrie · 2025-02-20T18:02:16Z

For Reference: source_uri is the PMID, sources includes both PMID and doi (let me know if any of that should change)

In the ADC the pub_ids field is an array of CURIEs that can be either PMID or DOI - see https://github.com/airr-community/airr-standards/blob/70f7a00e3f9adfb19c7eaa1a0213ece2cf049f50/specs/airr-schema-openapi3.yaml#L2081

Currently the ADC to AKC conversion just copies the value. We need to decide what this should be for the AKC. The AKC Reference.sources is an array of uriorcurie type, which transforms nicely from the ADC.

bcorrie · 2025-02-20T18:12:55Z

the range for study_type is now not decided manually, but determined by the auto-converted AIRR LinkML. The issue here is that the source node is by AIRR defined as NCIT:C63536. James brought up that OBO ontologies are preferential, and that new terms can be added if needed for the AKC. While being able to use AIRR-based LinkML is great (e.g., all Rearrangement related stuff), but being forced to use a range defined by the AIRR schema seems like a limitation here (I'm here not worried about this one field, but thinking ahead a bit if this ends up happening regularly).

I agree, I don't think the AKC should be limited by what any of the component repositories limitations are. I think that we should determine the "right" definition for the AKC. The challenge then becomes mapping the data in the component repositories (e.g. the ADC study_type) to the appropriate definition in the AKC. If the AKC defines its own ontology (or an enum populated with allowable ontology terms) with additions like those from NCIT then this is easy. If we want to use a predefined study ontology, then we have to map terms between repositories (NCIT->AKC Ontology).

With that said, I believe NCIT is an OBO onotology, no? This would be the normal term used for a Case-Control Study in the ADC:

https://www.ebi.ac.uk/ols4/ontologies/ncit/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FNCIT_C15197

schristley · 2025-02-20T18:56:46Z

Hi @LonnekeScheffer , that's great progress! Before we dig too deep into the code and implementation, we should decide on where code will ultimately reside. While it's fine to have an example in ak-schema, we want the actual production programs to be outside the schema in its own repository. There's ak-etvl where I've been working on the main integration scripts, so I think that might be appropriate place for your transform script. We could also consider a separate repository if that makes sense.

In ak-etvl, I started putting together common shared functions in ak_schema_utils.py and you can take a look at iedb_transform.py which is where I turned James' initial example into a more complete script. There's a bunch of extraneous stuff at the end where I've been working through generating CSV files for easy import into the database, that's also stuff that needs to be generalized into common functions at some point.

schristley · 2025-02-20T19:11:18Z

For now, the crucial parts still missing from this conversion are 'metadata' specimen_collections/specimen_processings/assays and 'data' chains/tcell_receptors/epitopes. These are not (completely) in ImmuneSpace yet (as far as I know) -> I think this would be nice to discuss in the next Schema meeting, ideally Bjoern should be present for the big picture overview.

Yes, in particular the sequence data because just the study data isn't that useful without it. Here's my concept of how the process will work:

Somehow we know a new study is available with AIRR-seq data and we have an ID for it
Study data gets transformed: ImmuneSpace -> AK Schema -> AIRR JSON
The BioProject ID is used to search SRA (or other archive) to find AIRR-seq FASTQ files
Those FASTQ files are downloaded from SRA and uploaded to a VDJServer project
AIRR JSON is imported into VDJServer project: study, subject, sample, repertoires and repertoire groups are defined and associated with their appropriate sequence files
pre-processing of the FASTQ files
VDJ annotation of the sequences
Load study into the AIRR Data Commons
Load study into AKC

bcorrie · 2025-02-21T17:16:15Z

With that said, I believe NCIT is an OBO onotology, no? This would be the normal term used for a Case-Control Study in the ADC:

https://www.ebi.ac.uk/ols4/ontologies/ncit/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FNCIT_C15197

@LonnekeScheffer I see that the OBO base node for an Investigation that you mentioned (OBI:0000066) is a high level description with no subclasses to describe the type of investigation. So if want the OBI investigation to be part of the AKC enum, because we want to be able to use the logic that describes that ontology term, then we would still want to use the NCIT ontology terms as well to provide a more detailed description of the type of study. So as @jamesaoverton suggests, "... that OBO ontologies are preferential, and that new terms can be added if needed for the AKC" seems like the way to go.

@jamesaoverton would you envision this as adding new OBI terms under Investigation (OBI:0000066) through the formal OBO process (so OBI:0000066 would have children that described the type of investigation) or would we create a LinkML enum that has NCIT terms in it and has a class_uri: OBI:0000066 attribute?

jamesaoverton · 2025-02-21T17:27:56Z

NCIT is not an OBO ontology, despite those OBO PURLs. There's a translation of NCIT into OBO-compatible annotations ("NCI Thesaurus OBO Edition"), but it does not follow OBO principles or best practices, does not (re)use terms from other OBO projects, and it never will. Yes, this is very confusing, and after many years I'm still upset about the situation.

We can easily add subclasses of investigation to OBI. We tend to distinguish investigations using other properties, such as study designs and assays, but we've been having recent conversations about just adding more investigation subclasses for ImmuneSpace and others to use.

bcorrie · 2025-02-22T00:48:55Z

In the AIRR Standard, the field study_type is used to capture the study design, so this would be more equivalent to OBI study_design as defined here: http://purl.obolibrary.org/obo/OBI_0500000. In the AIRR Standard we use NCIT:C63536 as the root node, and it has similar child nodes as OBI:0500000 (e.g. Case-Control Study). Although NCIT has a large number of child nodes, in practice there are only 6 child nodes used out of all of the current studies in the ADC. So it doesn't seem to me like we need to add more subclasses to OBI investigation but instead we should be considering using OBI study_design and maybe (but maybe not) adding more subclassed to that.

So for the AKC, I think we need to decide whether we want to use an OBI ontology tree (starting at study_design, OBI:0500000 as the "base node") for the AKC study_type field. If so, then we need to define a mapping from an NCIT study_type to an OBI study_design for the translation from the ADC to the AKC. I don't think this would be too hard. Currently the ADC has these:

Study
Case-Control Study
Observational Study
Longitudinal Study
Nested Case-Control Study
Phase II Trial
Animal Study

For the most part there is a one-to-one mapping.

If we did that and ImmuneSpace also used OBI study_type and its children to denote study design, then that would work. @LonnekeScheffer it sounds to me currently like there is no use of study_design in ImmuneSpace since you said that everything was an investigation, is that correct?

schristley · 2025-02-22T02:09:05Z

in practice there are only 6 child nodes used out of all of the current studies in the ADC.

Thanks for checking that, I thought it was a low number too.

For the most part there is a one-to-one mapping.

Ok, I'm good with the idea of using OBI so long as we can map to the same semantics. I see some of the mappings but not sure about these:

Nested Case-Control Study
Phase II Trial
Animal Study

If yes, then need to change Investigation to use a new slot instead of study_type, maybe study_design?

@jamesaoverton Is this where we supposedly use SSSOM to describe the mapping from the NCIT to OBI ontologies? Ah, I saw too late your previous comment about a translation being available, so I guess we should try to use that.

LonnekeScheffer · 2025-02-26T00:05:38Z

If we did that and ImmuneSpace also used OBI study_type and its children to denote study design, then that would work. @LonnekeScheffer it sounds to me currently like there is no use of study_design in ImmuneSpace since you said that everything was an investigation, is that correct?

Apologies for the late response. I didn't find anything in ImmuneSpace mapping to something that could be considered a study type or design. James recommended me to use the OBI 'Investigation' but since this is now hardcoded for all ImmuneSpace studies it's not super meaningful. I looked through ImmPort and didn't find anything I'd confidently say describes the same thing (closest I found was 'research focus' which can be stuff like 'vaccine response', 'epidemiology', 'computational modeling' to name a few) @jamesaoverton do you know of anything in ImmuneSpace or ImmPort that would fit under the study design as described by Brian?

bcorrie · 2025-02-26T18:17:07Z

NCIT is not an OBO ontology, despite those OBO PURLs. There's a translation of NCIT into OBO-compatible annotations ("NCI Thesaurus OBO Edition"), but it does not follow OBO principles or best practices, does not (re)use terms from other OBO projects, and it never will. Yes, this is very confusing, and after many years I'm still upset about the situation.

@jamesaoverton naive ontology person questions/comments:

The OBO study_design (OBI:0500000) doesn't seem very complete to me, at least compared to the richness of NCIT. The AIRR Community ontology group explored ontologies pretty extensively when they were chosen, and NCIT seemed to be the most appropriate to describe what the AIRR Community seemed to feel was important (See https://docs.airr-community.org/en/latest/ontovoc/introduction_ontovoc.html).

So my question would be why isn't the OBO study_design more complete? 8-)

In particular, if there are other study design ontologies that are more complete (e.g. NCIT), but don't adhere to the OBO principles etc, why hasn't the OBO community built on the work of others like NCIT and add the features/terms that are missing? I am not sure what would be required such that NCIT did follow OBO best practices/principles and reuse terms from other OBO projects. I realize that this may be a lot of work, and you would hope that the NCIT folks would do this (you suggest they haven't and never will which seems unfortunate), but now it seems like we have the worst of both worlds. An NCIT study design ontology that is complete, but can't be integrated with other OBO ontologies and an incomplete OBO study design ontology that doesn't meet the needs of the community - at least it didn't meet the needs of the AIRR Community based on the OntoVoc working groups assessment at the time.

In our current case, the fact that only a small number of terms from NCIT are used currently in the ADC, and that there is a reasonable mapping of those terms into OBO study_design fields, makes it possible to do this transformation. But I would argue that it would be desirable to have a more rich OBO study_design. It is not hard for me to envision cases where a study design would be more accurately described currently with NCIT than it could be with OBO study_design. If it is easy to add ontology terms, why don't we just add the subtrees from NCIT that we think are appropriate to OBO study_design. And if this is easy, why hasn't someone already done this 8-)

schristley · 2025-02-26T21:37:28Z

@LonnekeScheffer Followup from our 1-on-1 discussion, I looked at my current development tree, for ak-etvl the ak-schema submodule is set to the integration-prototype branch. I can merge it but I expect that there will be more schema modifications as the integration work progresses.

My suggestion is to "wrap up" this integration example then move your integration script over to ak-etvl. Some of your ontology errors may resolve themselves if your ak-schema is on that integration branch.

jamesaoverton · 2025-02-27T15:04:18Z

@bcorrie NCIT is a thesaurus. It's a bunch of labels and textual definitions, roughly collected by topic into a tree. It has almost 30,000 terms, which seems impressive, until you realize that it spans a couple dozen branches from genes to organisms to diseases to food.

OBO is a community of ontologies. We have a couple hundred ontologies (not all of them great), covering millions of terms. Because they're ontologies, they have logical definitions that automated reasoners and query engines can use -- not just string searches. Because they follow a shared set of principles and best practices, the logical definitions can link many ontologies together. Gene stuff goes in the Gene Ontology, organisms are in the NCBI Taxonomy, diseases are in various species-specific disease ontologies, and there's a whole Food Ontology, and hundreds more.

What NCIT has that OBI does not have is funding. OBI, like most OBO ontologies, relies on the funded projects that use it to donate some time and submit new terms. Adding more study designs is as easy as making a PR adding rows to this table, and passing review: https://github.com/obi-ontology/obi/blob/master/src/ontology/templates/study-designs.tsv. I don't know how you get a fix or a new term into NCIT.

bcorrie · 2025-02-27T18:06:44Z

What NCIT has that OBI does not have is funding. OBI, like most OBO ontologies, relies on the funded projects that use it to donate some time and submit new terms. Adding more study designs is as easy as making a PR adding rows to this table, and passing review: https://github.com/obi-ontology/obi/blob/master/src/ontology/templates/study-designs.tsv.

I think this is my point. I am surprised that no one in the OBI world has added more complete terms and structure to OBI study_design since it seems foundational to OBI's mandate - in particular when you have a "nice" subtree like the NCIT study tree to build from. Because NCIT is a tree it at least has an is-a relationship that can be used.

Maybe we do this through AKC, as one of those funded projects that contributes, but as I say I am surprised that the OBI study_design is not more rich already...

@schristley do we take the high road and try to make OBI study_design more complete?

@jamesaoverton would it be considered bad form to just dump the hierarchy for a set of NCIT subtrees into the OBI table and make the pull request 8-) I could see adding NCIT trees like Observational Study, Case Series, Clinical Study, Case-Control Study and it seems like they could be added fairly easily given the table - but that would probably double the size of OBI study_design...

The fact that this hasn't been done before makes me a bit skeptical about how easy you say it is 8-)

jamesaoverton · 2025-02-27T18:28:39Z

Sorry, I don't understand. NCIT has NCIT:C15320 'Study Design' with 166 descendant terms, but those descendants include terms such as 'Study Arm' with 61 children. It is simply false to say that "'Study Arm' is a 'Study Design'".

https://www.ebi.ac.uk/ols4/ontologies/ncit/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FNCIT_C15320

On the other hand we have OBI, with OBI:0500000 'study design' which has 109 descendants in OLS that are either actual subclasses or parts, clearly distinguished with logical axioms. I don't see how NCIT is better than OBI on this measure.

https://www.ebi.ac.uk/ols4/ontologies/obi/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FOBI_0500000

In practice, OBI users use more detailed terms and more specific axioms, and not the broad high-level terms that NCIT uses. But there's a clear path to add these broad terms to OBI.

OBO projects all have open licenses. I'm not clear on the license for NCIT, so I wouldn't be comfortable copy-pasting from it into OBI until that's clarified.

jamesaoverton · 2025-02-27T18:31:45Z

Now I see that NCIT's 'Study' term has a couple hundred subclasses that have convenient labels https://www.ebi.ac.uk/ols4/ontologies/ncit/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FNCIT_C63536?lang=en which we would model in OBI by logically expressing the parts and their relations. But I admit that sometimes broad terms are helpful.

bcorrie · 2025-02-27T20:33:18Z

Now I see that NCIT's 'Study' term has a couple hundred subclasses that have convenient labels https://www.ebi.ac.uk/ols4/ontologies/ncit/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FNCIT_C63536?lang=en which we would model in OBI by logically expressing the parts and their relations. But I admit that sometimes broad terms are helpful.

Yes, that is the hierarchy I was talking about, but when you pointed to the other one I admit I got very confused - thought I was looking at the same thing but they were different. Sheesh, this is a messy world. 8-)

It looks like NCIT is CC-BY-4.0 (https://bioportal.bioontology.org/ontologies/NCIT) - presumably the approach one would take in using NCIT as a source for the creation of OBI terms would be to create an OBI term, define any required axioms for that term, and reference the NCIT term as a synonym (or the "definition source" in the table)???

Correct me if I am wrong, but one of the nice things about the NCIT Study is that it has quite a bit more structure to it than OBI, in that its tree is more "complex" and therefore there are a lot of is-a relationships built in (don't ask me if the tree encapsulates a good set of relationships). The OBI study_design is relatively flat, so there are not a lot of inferences that you can do between the entities within study_design.

Again, correct me if I am wrong... For case-control study (which we use in the ADC) https://ontobee.org/ontology/OBI?iri=http://purl.obolibrary.org/obo/OBI_0002624 it has a super class of study_design but no other axioms related to it that I can see. So the OBI case-control study definition is not really any more complex (from a rule/axiom perspective) than the NCIT one is it? So for OBI case-control study you inherit all of the axioms/relationships from its super class, but there are no axioms involving the term case-control study itself that infer other relationships to other OBI entities. Or am I missing something?

Bottom line, I am not sure what the correct approach is here, but I think I am slowly wrapping my head around the differences 8-)

bcorrie · 2025-03-05T17:18:00Z

In meeting today, discussed that using OBI was best, will add terms used by ADC (listed here: #20 (comment)) to OBI - James to make request to LJI team for this. OBI term would have as a reference the source NCIT terms so mapping would be possible.

Only potential caveat is that if an ADC study is added with a different NCI term that is not in OBI we would have a gap...

@schristley any thoughts/concerns?

schristley · 2025-03-12T23:32:11Z

@LonnekeScheffer

I'm not sure yet how to convert to ontologies with a flexible vocabulary (permissible values tbd based on ontology root node), or if some functionality still needs to be implemented for that?

minor point: I didn't find phenotypic_sex in ImmuneSpace, only biological_sex. I think it's not in ImmPort either (I only see 'gender')

Values need to be ontology IDs, I've changed the code for a simple example for biological sex, using PATO IDs. You'll need to do this for LifeEventProcessOntology and ExposureMaterialOntology values as those are still throwing errors.

# Conflicts: # project/jsonld/ak_schema.jsonld # project/owl/ak_schema.owl.ttl # project/sqlddl/ak_schema.sql # src/ak_schema/datamodel/ak_schema.py

making a meaningless change (added space) to test if pushing via GitHub works

…munespace_integration

- update according to new data model - for now, not use ImmPort db (simplified) - make random UUID identifiers (and export mapping from legible to random) - started working on script to export VDJServer metadata files

LonnekeScheffer added 7 commits February 18, 2025 13:28

first draft: immunespace-to-akc functionality

e63b161

add extra enum values from immuneSpace + make inclusion & exclusion c…

7d17fb1

…riteria multivalued

finished main immunespace-to-akc conversion functionalities

e0e2819

immunespace to akc: implemented tests, fix bugs, write json output file

0dcdf5b

update readme

9405112

update readme

a2869c6

minor cleanup of comments for discussion on Github

0022ddf

schristley added 2 commits March 12, 2025 18:20

Merge branch 'main' into immunespace_integration

6bf93af

values should be ontology IDs

b68cd56

This was referenced Mar 13, 2025

Use OBI for study/investigation type instead of NCIT used by AIRR airr-knowledge/issues#81

Closed

The study types (NCIT) currently in use by AIRR needed to be added to OBI airr-knowledge/issues#82

Closed

schristley and others added 7 commits March 12, 2025 19:54

Merge branch 'main' into immunespace_integration

5711a12

slots have changed

e402339

Merge branch 'main' into immunespace_integration

c70b4da

# Conflicts: # project/jsonld/ak_schema.jsonld # project/owl/ak_schema.owl.ttl # project/sqlddl/ak_schema.sql # src/ak_schema/datamodel/ak_schema.py

updated immunespace_to_akc + tests to work with the new data model

14a5e32

test push

1b00c45

making a meaningless change (added space) to test if pushing via GitHub works

Merge remote-tracking branch 'origin/immunespace_integration' into im…

720b361

…munespace_integration

updated immunespace_to_akc:

65dc8f1

- update according to new data model - for now, not use ImmPort db (simplified) - make random UUID identifiers (and export mapping from legible to random) - started working on script to export VDJServer metadata files

Conversation

LonnekeScheffer commented Feb 20, 2025 • edited by schristley Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2025-08-14 00:31 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

bcorrie commented Feb 20, 2025

Uh oh!

bcorrie commented Feb 20, 2025

Uh oh!

schristley commented Feb 20, 2025

Uh oh!

schristley commented Feb 20, 2025

Uh oh!

bcorrie commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamesaoverton commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bcorrie commented Feb 22, 2025

Uh oh!

schristley commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LonnekeScheffer commented Feb 26, 2025

Uh oh!

bcorrie commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

schristley commented Feb 26, 2025

Uh oh!

jamesaoverton commented Feb 27, 2025

Uh oh!

bcorrie commented Feb 27, 2025

Uh oh!

jamesaoverton commented Feb 27, 2025

Uh oh!

jamesaoverton commented Feb 27, 2025

Uh oh!

bcorrie commented Feb 27, 2025

Uh oh!

bcorrie commented Mar 5, 2025

Uh oh!

schristley commented Mar 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LonnekeScheffer commented Feb 20, 2025 •

edited by schristley

Loading

github-actions Bot commented Feb 20, 2025 •

edited

Loading

Built to branch `gh-pages` at 2025-08-14 00:31 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

bcorrie commented Feb 21, 2025 •

edited

Loading

jamesaoverton commented Feb 21, 2025 •

edited

Loading

schristley commented Feb 22, 2025 •

edited

Loading

bcorrie commented Feb 26, 2025 •

edited

Loading