Genotype integration from VDJbase#3
Conversation
separarate function so that it can be re-used in vdjbase metadata analysis
|
My initial idea of the transform process goes like this:
AIRRSequencingAssay —sequencing_files—> AIRRSequencingData <—has_specified_input— DataTransformation —has_specified_output—>GenotypeData
While this structure conforms to the data model, I'm not sure how efficient queries will be because there are numerous joins/tables in SQL between the Participant and the actual genotype data. We might need to decide to add some more direct relationships. |
|
I was looking at the genotype data and realizing that it doesn’t really have links back to the repertoire, so I don’t think my idea of looping through the genotypes will work? I see there is a “subject_name” field which I guess is a custom field for VDJbase, but if I search for that subject_name in the repertoires then I don’t find it as a subject_id or repertoire_id. For the AIRR Standards, I guess we didn’t formulate exactly how to link genotype sets with repertoires/subjects? There is the genotype field within the AIRR Subject, but that VDJbase doesn't seem to be storing it there. Is this something lacking in the AIRR Standards? Or we can just resolve it for the AKC. |
|
The request to VDJbase returns repertoire metadata including genotypes -
the gentotypes are embedded in the repertoires - so the info is linked.
William
…------ Original Message ------
From "Scott Christley" ***@***.***>
To "airr-knowledge/ak-etvl" ***@***.***>
Cc "William Lees" ***@***.***>; "Assign"
***@***.***>
Date 25/09/2025 18:05:26
Subject Re: [airr-knowledge/ak-etvl] Genotype integration from VDJbase
(PR #3)
schristley left a comment (airr-knowledge/ak-etvl#3)
<#3 (comment)>
I was looking at the genotype data and realizing that it doesn’t really
have links back to the repertoire, so I don’t think my idea of looping
through the genotypes will work?
I see there is a “subject_name” field which I guess is a custom field
for VDJbase, but if I search for that subject_name in the repertoires
then I don’t find it as a subject_id or repertoire_id.
For the AIRR Standards, I guess we didn’t formulate exactly how to link
genotype sets with repertoires/subjects? There is the genotype field
within the AIRR Subject, but that VDJbase doesn't seem to be storing it
there. Is this something lacking in the AIRR Standards? Or we can just
resolve it for the AKC.
—
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACXBI7JXRCLURPNAGRN5RUT3UQOFNAVCNFSM6AAAAACHMW4YM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGMZVGEYDOMRZGY>.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
|
Hi Scott. Having looked at this further I realised that there was a substantial issue here. The genotypes returned by VDJbase describe the subject and sample that they refer to by the VDJbase name (eg P1_I2_S3 where P, I and S refer to the project, Individual and Sample). The VDJbase name for the repoertoire is also returned in the VDJbase repertoire metadata, but it is not recognised by the akc schema because this is not a defined field in the AIRR schema. Arguably the problem is caused by the API call which returns all genotypes in a single file with no other repertoire metadata, and also by the AIRR schema which links genotypes directly to subject without referring to the sample from which they were derived. I argued hard against this definition at the time. I have addressed the problem by pre-scanning the VDJbase repertoire metadata in vdjbase_metadata_transform.py, in order to build a dictionary that translates vdjbase sample name to the airr repertoire study_id and subject_id. This is further translated after repertoire processing into a map between vdjbase name and the akc_ids for investigation and participant. Other akc_ids could be provided in the mapping if necessary. I think this gives us a workable implementation, although one that could be improved by refining the AIRR schema, given the will of those that would be involved. I have checked in the code, which is working up to this point. The next question is how the genotypes should actually be linked: what objects do you want me to create and what fields should I populate ? I can't for example find an Assay for genotyping, or a Conclusion. Perhaps there is a field I have overlooked. At any rate if you could please spell this out for me I think I can finish this code fairly quickly. Thanks William |
|
Hi William, I'm finally getting a chance to look at this again.
Is that something we should add to the AKC schema? If you've extended AIRR to better support the mapping then we should be able to support it directly in AKC.
Ok, yeah I partially remember this conversation. What was your preference, to put the genotype with the repertoire?
Ok, I hope that wasn't unneeded work. Would it be simplified in some way by updating the AKC schema? I guess the API could also be modified if necessary. |
|
Hi William, I pushed code for an Next is to maybe write a simple function that returns the AK Investigation based upon a I need to send you a TGZ copy of the transformed data though. I'll tar that up and email to you. |
|
Thanks Scott. I don't think we need to look at schema changes right now. The code is working and it's handling the data correctly. We can refine it a bit later but I think things are ok for the time being. What I really need now are answers to these questions:
That will let me write AKC objects from VDJbase data in the same way that adc_repertoire_transform.py writes AKC objects from ADC data. I am a bit confused about load_adc_container and where it fits, and why the data you are planning to zip up is going to be helpful to me, perhaps we can find some time to talk about that in one of the coming calls. |
If the genotype is for a study that's already in the ADC, we don't want to re-transform VDJbase's repertoire metadata. The reason is that different AKC IDs will be assigned, and it will end up creating duplicate study data. But this isn't that critical at this stage I guess. The code that is calling Your code that creates mappings, |
The assay is Getting the participant for that assay then requires following a few links. The assay should point to a We are going to represent the Genotype as a What's missing then is for these objects to be into the |
|
Hi Scott,
This is implemented in transform_airr_repertoires.py. Specifically at line 76, an Investigation is only created if the existing archival_id is not found in the container already. Likewise at lines 139 and 198, Participant and Specimen are only created if the existing objects are not in the container. Hence if the container is pre-loaded with current studies, they will not be re-processed. The code relies on IDs that should be the same even if the repertoire metadata is re-coded: specifically MiAIRR study_id, subject_id, sample_id. I'm sure we will find corner cases, though.
Are you going to implement these? I'm ready to write the code (at around line 182 in vdjbase_metadata_transform.py) once the object definitions are in place. |
|
Hi William,
Ok, great. I haven't had a chance to test the integration with your re-factoring, but hopefully I can before the next schema meeting.
Yes, I just pushed changes to the |
|
Hi Scott, A couple of things I'll raise this afternoon:
All the best William |
|
Sorry one additional thing for this afternnoon. in ak_schema.py there is a definition of a GenotypeSet which takes as a member a list of Genotypes (genotype_class_list). however AIRRGenotypeData takes a list of Genotypes rather than a GenotypeSet. I think it needs to be modified to take the GenotypeSet instead, or else we could just drop GenotypeSet. |
|
|
Hi William, can you try this code?
|
|
Thanks, that works now, once I change the code to pass Is there a reason we're using an out of date version of Python by the way? More recent versions like 3.12 and 3.13 have better stack traces and error reporting. 3.13 is a year old now, so not exactly bleeding edge. |
Only because linkml was linked to that version, and the newer pythons have conflicts in the docker build. This will hopefully be cleaned up when upgrading to the new build system linkml uses. |
|
Hi William,
I added this and also had to make some changes to the output routines for InputOutputDataMap. It seems to be working now. |
|
Hi William,
I was originally thinking of AIRRGenotypeData to be mostly equivalent to AIRR GenotypeSet, but that might be too coarse-grained. Maybe a better idea is for AIRRGenotypeData to be mostly equivalent to AIRR Genotype, so can be more specific with the input files to output mapping. InputOutputDataMap allows multiple inputs; they get specified as multiple objects, e.g., with same data transformation ID and output ID but different input ID. The three IDs together is a compound identifier. |
|
Yes the genotype code seems to be working now. Personally I think we should keep the input files to output mapping as it is, with potentially multiple InputOutputDataMap objects. I think the granularity is fine and we may not get a consistent mapping between the input files and the genotype at locus level, because it depends how the files were organised at the point of SRA submission |
We can split this up into two scenarios:
I think we should focus on this latter one as it is can be used for both. In theory, part 1 can split into two steps: 1) use the standard ADC integration to bring in the VDJbase data then 2) use the same part 2 above to bring in genotype data.
The challenge will be to link up the identifiers between the study, subject and genotype. Once we have that worked out, then it will be easier to dig into the details of how the genotype data is represented.
A few studies (not exhaustive) that already exist in ADC that we can use for testing: