ImmuneSpace to AKC conversion#20
Conversation
…riteria multivalued
|
In the ADC the pub_ids field is an array of CURIEs that can be either PMID or DOI - see https://github.com/airr-community/airr-standards/blob/70f7a00e3f9adfb19c7eaa1a0213ece2cf049f50/specs/airr-schema-openapi3.yaml#L2081 Currently the ADC to AKC conversion just copies the value. We need to decide what this should be for the AKC. The AKC |
I agree, I don't think the AKC should be limited by what any of the component repositories limitations are. I think that we should determine the "right" definition for the AKC. The challenge then becomes mapping the data in the component repositories (e.g. the ADC With that said, I believe NCIT is an OBO onotology, no? This would be the normal term used for a Case-Control Study in the ADC: |
|
Hi @LonnekeScheffer , that's great progress! Before we dig too deep into the code and implementation, we should decide on where code will ultimately reside. While it's fine to have an example in ak-schema, we want the actual production programs to be outside the schema in its own repository. There's ak-etvl where I've been working on the main integration scripts, so I think that might be appropriate place for your transform script. We could also consider a separate repository if that makes sense. In ak-etvl, I started putting together common shared functions in |
Yes, in particular the sequence data because just the study data isn't that useful without it. Here's my concept of how the process will work:
|
@LonnekeScheffer I see that the OBO base node for an Investigation that you mentioned (OBI:0000066) is a high level description with no subclasses to describe the type of investigation. So if want the OBI investigation to be part of the AKC enum, because we want to be able to use the logic that describes that ontology term, then we would still want to use the NCIT ontology terms as well to provide a more detailed description of the type of study. So as @jamesaoverton suggests, "... that OBO ontologies are preferential, and that new terms can be added if needed for the AKC" seems like the way to go. @jamesaoverton would you envision this as adding new OBI terms under |
|
NCIT is not an OBO ontology, despite those OBO PURLs. There's a translation of NCIT into OBO-compatible annotations ("NCI Thesaurus OBO Edition"), but it does not follow OBO principles or best practices, does not (re)use terms from other OBO projects, and it never will. Yes, this is very confusing, and after many years I'm still upset about the situation. We can easily add subclasses of investigation to OBI. We tend to distinguish investigations using other properties, such as study designs and assays, but we've been having recent conversations about just adding more investigation subclasses for ImmuneSpace and others to use. |
|
In the AIRR Standard, the field So for the AKC, I think we need to decide whether we want to use an OBI ontology tree (starting at
For the most part there is a one-to-one mapping. If we did that and ImmuneSpace also used OBI |
Thanks for checking that, I thought it was a low number too.
Ok, I'm good with the idea of using OBI so long as we can map to the same semantics. I see some of the mappings but not sure about these:
If yes, then need to change @jamesaoverton Is this where we supposedly use SSSOM to describe the mapping from the NCIT to OBI ontologies? Ah, I saw too late your previous comment about a translation being available, so I guess we should try to use that. |
Apologies for the late response. I didn't find anything in ImmuneSpace mapping to something that could be considered a study type or design. James recommended me to use the OBI 'Investigation' but since this is now hardcoded for all ImmuneSpace studies it's not super meaningful. I looked through ImmPort and didn't find anything I'd confidently say describes the same thing (closest I found was 'research focus' which can be stuff like 'vaccine response', 'epidemiology', 'computational modeling' to name a few) @jamesaoverton do you know of anything in ImmuneSpace or ImmPort that would fit under the study design as described by Brian? |
@jamesaoverton naive ontology person questions/comments: The OBO study_design (OBI:0500000) doesn't seem very complete to me, at least compared to the richness of NCIT. The AIRR Community ontology group explored ontologies pretty extensively when they were chosen, and NCIT seemed to be the most appropriate to describe what the AIRR Community seemed to feel was important (See https://docs.airr-community.org/en/latest/ontovoc/introduction_ontovoc.html). So my question would be why isn't the OBO study_design more complete? 8-) In particular, if there are other study design ontologies that are more complete (e.g. NCIT), but don't adhere to the OBO principles etc, why hasn't the OBO community built on the work of others like NCIT and add the features/terms that are missing? I am not sure what would be required such that NCIT did follow OBO best practices/principles and reuse terms from other OBO projects. I realize that this may be a lot of work, and you would hope that the NCIT folks would do this (you suggest they haven't and never will which seems unfortunate), but now it seems like we have the worst of both worlds. An NCIT study design ontology that is complete, but can't be integrated with other OBO ontologies and an incomplete OBO study design ontology that doesn't meet the needs of the community - at least it didn't meet the needs of the AIRR Community based on the OntoVoc working groups assessment at the time. In our current case, the fact that only a small number of terms from NCIT are used currently in the ADC, and that there is a reasonable mapping of those terms into OBO |
|
@LonnekeScheffer Followup from our 1-on-1 discussion, I looked at my current development tree, for My suggestion is to "wrap up" this integration example then move your integration script over to ak-etvl. Some of your ontology errors may resolve themselves if your ak-schema is on that integration branch. |
|
@bcorrie NCIT is a thesaurus. It's a bunch of labels and textual definitions, roughly collected by topic into a tree. It has almost 30,000 terms, which seems impressive, until you realize that it spans a couple dozen branches from genes to organisms to diseases to food. OBO is a community of ontologies. We have a couple hundred ontologies (not all of them great), covering millions of terms. Because they're ontologies, they have logical definitions that automated reasoners and query engines can use -- not just string searches. Because they follow a shared set of principles and best practices, the logical definitions can link many ontologies together. Gene stuff goes in the Gene Ontology, organisms are in the NCBI Taxonomy, diseases are in various species-specific disease ontologies, and there's a whole Food Ontology, and hundreds more. What NCIT has that OBI does not have is funding. OBI, like most OBO ontologies, relies on the funded projects that use it to donate some time and submit new terms. Adding more study designs is as easy as making a PR adding rows to this table, and passing review: https://github.com/obi-ontology/obi/blob/master/src/ontology/templates/study-designs.tsv. I don't know how you get a fix or a new term into NCIT. |
I think this is my point. I am surprised that no one in the OBI world has added more complete terms and structure to OBI Maybe we do this through AKC, as one of those funded projects that contributes, but as I say I am surprised that the OBI @schristley do we take the high road and try to make OBI @jamesaoverton would it be considered bad form to just dump the hierarchy for a set of NCIT subtrees into the OBI table and make the pull request 8-) I could see adding NCIT trees like The fact that this hasn't been done before makes me a bit skeptical about how easy you say it is 8-) |
|
Sorry, I don't understand. NCIT has NCIT:C15320 'Study Design' with 166 descendant terms, but those descendants include terms such as 'Study Arm' with 61 children. It is simply false to say that "'Study Arm' is a 'Study Design'". On the other hand we have OBI, with OBI:0500000 'study design' which has 109 descendants in OLS that are either actual subclasses or parts, clearly distinguished with logical axioms. I don't see how NCIT is better than OBI on this measure. In practice, OBI users use more detailed terms and more specific axioms, and not the broad high-level terms that NCIT uses. But there's a clear path to add these broad terms to OBI. OBO projects all have open licenses. I'm not clear on the license for NCIT, so I wouldn't be comfortable copy-pasting from it into OBI until that's clarified. |
|
Now I see that NCIT's 'Study' term has a couple hundred subclasses that have convenient labels https://www.ebi.ac.uk/ols4/ontologies/ncit/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FNCIT_C63536?lang=en which we would model in OBI by logically expressing the parts and their relations. But I admit that sometimes broad terms are helpful. |
Yes, that is the hierarchy I was talking about, but when you pointed to the other one I admit I got very confused - thought I was looking at the same thing but they were different. Sheesh, this is a messy world. 8-) It looks like NCIT is CC-BY-4.0 (https://bioportal.bioontology.org/ontologies/NCIT) - presumably the approach one would take in using NCIT as a source for the creation of OBI terms would be to create an OBI term, define any required axioms for that term, and reference the NCIT term as a synonym (or the "definition source" in the table)??? Correct me if I am wrong, but one of the nice things about the NCIT Again, correct me if I am wrong... For Bottom line, I am not sure what the correct approach is here, but I think I am slowly wrapping my head around the differences 8-) |
|
In meeting today, discussed that using OBI was best, will add terms used by ADC (listed here: #20 (comment)) to OBI - James to make request to LJI team for this. OBI term would have as a reference the source NCIT terms so mapping would be possible. Only potential caveat is that if an ADC study is added with a different NCI term that is not in OBI we would have a gap... @schristley any thoughts/concerns? |
Values need to be ontology IDs, I've changed the code for a simple example for biological sex, using PATO IDs. You'll need to do this for |
# Conflicts: # project/jsonld/ak_schema.jsonld # project/owl/ak_schema.owl.ttl # project/sqlddl/ak_schema.sql # src/ak_schema/datamodel/ak_schema.py
…munespace_integration
- update according to new data model - for now, not use ImmPort db (simplified) - make random UUID identifiers (and export mapping from legible to random) - started working on script to export VDJServer metadata files
I implemented the script immunespace_to_akc.py which takes data from ImmuneSpace and converts it to an AKC output Json file.
Points I'd like input on
If you agree with my current/suggested solution feel free to just check the box without commenting.
@schristley @jamesaoverton @bcorrie AKC has inclusion_criteria both on
StudyArmlevel, and on the fullInvestigationlevel. The descriptions the same in both cases (StudyArm). I presume that on the StudyArm level, these criteria relate to inclusion/exclusion in the arm, not the whole study? As I wrote this I found Investigation should not have inclusion_criteria and exclusion_criteria issues#62 and commented there, feel free to continue the discussion there. Here is what I did for now:@schristley For
Reference: source_uri is the PMID, sources includes both PMID and doi (let me know if any of that should change). Just wanted to note: the type for source_uri somehow gets converted to a ReferenceSourceUri object instead of Curie, without ‘resolving’ the Curie into an actual URI. Will that be an issue? I’m not sure where this behavior originates since the source_uri slot has range: uriorcurie in the ak_schema.yaml@schristley / @jamesaoverton: identifiers are still a little messy and I'd like your input. For now, they follow the format ImmuneSpace:{name of the item}-{identifier}. For instance: "ImmuneSpace:investigation-SDY460", "ImmuneSpace:arm-ARM2480", "ImmuneSpace:participant-SUB134239". These look like CURIEs but they are not. Interested to hear your take on what the IDs should be. Some thoughts:
study_type is now always OBI:0000066 (investigation)
Investigation.archival_id @schristley would it suffice here to have a CURIE pointing towards ImmuneSpace: https://immunespace.org/query/study/SDY460 ? Or would you specifically want the BioProject ID? The former can be done very easily, the latter is a small workaround but still possible.
I'm not sure yet how to convert to ontologies with a flexible vocabulary (permissible values tbd based on ontology root node), or if some functionality still needs to be implemented for that? @schristley
Dealing with enum case: IEDB has value "Documented exposure without evidence for disease", I now imported a bunch of enum values from ImmuneSpace, where this one happened to be overwritten by "documented exposure without evidence for disease". Sure, can be fixed in many easy ways, but maybe we should decide on a specific way of dealing with synonymous enums? @schristley @bcorrie
minor point: for references, while in theory multiple studies/investigation can relate to 1 reference, in this conversion the number of investigations per reference is always 1 (the study that is being converted). In theory, the same reference (PMID) could come up in multiple studies, and those may need to be merged somewhere, somehow. I believe this is not something to resolve at the level of this script though.
minor point: I didn't find phenotypic_sex in ImmuneSpace, only biological_sex. I think it's not in ImmPort either (I only see 'gender') @jamesaoverton can you confirm? I think we can live without it.
minor point: ImmuneSpace has 'race' and 'race specify'. As far as I understand 'race' is a closed vocabulary and in 'race specify' people can fill in whatever they want if they selected race: other. I omitted 'race specify' information for now, keeping only 'race'.
minor point: @jamesaoverton can you confirm that the author list is always formatted the same in ImmuneSpace? (i.e., I can safely split on ", " to get a list of author names?)