Genotype integration from VDJbase by schristley · Pull Request #3 · airr-knowledge/ak-etvl

schristley · 2025-09-24T17:02:16Z

We can split this up into two scenarios:

Genotype data where the AIRR-seq data resides only in VDJbase.
Genotype data where the AIRR-seq data resides in both ADC and VDJbase (though the sequence data has likely been processed differently).

I think we should focus on this latter one as it is can be used for both. In theory, part 1 can split into two steps: 1) use the standard ADC integration to bring in the VDJbase data then 2) use the same part 2 above to bring in genotype data.

The challenge will be to link up the identifiers between the study, subject and genotype. Once we have that worked out, then it will be easier to dig into the details of how the genotype data is represented.

A few studies (not exhaustive) that already exist in ADC that we can use for testing:

PRJNA300878 (cache: 6508961642208563691-242ac113-0001-012)
PRJNA509910 (cache: 8575123754278514195-242ac11b-0001-012)
PRJNA680539 (cache: 6270798281029250580-242ac117-0001-012)

ak_Data_dir

separarate function so that it can be re-used in vdjbase metadata analysis

Genotype

schristley · 2025-09-24T22:52:30Z

My initial idea of the transform process goes like this:

Load up the already transformed ADC metadata (Investigations, Participants, Assays, and maybe others).
Load up the genotype data in AIRR format from VDJbase.
Load up the AIRR repertoire data in AIRR format from VDJbase? (I think this is needed)
For each genotype:
Find the appropriate study/subject in the AIRR repertoire for the genotype.
Search the transformed ADC metadata for the Investigation, based on study_id.
Search the transformed ADC metadata for the Participant, based on subject_id?
Find the AIRRSequencingAssay for the Participant, add the objects for the genotype_inference process and for the AIRRGenotypeData.

AIRRSequencingAssay —sequencing_files—> AIRRSequencingData <—has_specified_input— DataTransformation —has_specified_output—>GenotypeData

AIRRSequencingAssay for the AIRR-seq experiment (this already exists)
AIRRSequencingFiles is raw FASTQ data and is the sequencing_files for that AIRRSequencingAssay (this already exists)
DataTransformation for the genotype inference (new object)
AIRRGenotypeData with genotype enum to represent the genotype data (new object)
Lastly, entry in the InputOutputDataMap that links input (AIRRSequencingFiles) with output (genotype data item) for that DataTransformation

While this structure conforms to the data model, I'm not sure how efficient queries will be because there are numerous joins/tables in SQL between the Participant and the actual genotype data. We might need to decide to add some more direct relationships.

schristley · 2025-09-25T17:05:05Z

I was looking at the genotype data and realizing that it doesn’t really have links back to the repertoire, so I don’t think my idea of looping through the genotypes will work?

I see there is a “subject_name” field which I guess is a custom field for VDJbase, but if I search for that subject_name in the repertoires then I don’t find it as a subject_id or repertoire_id.

For the AIRR Standards, I guess we didn’t formulate exactly how to link genotype sets with repertoires/subjects? There is the genotype field within the AIRR Subject, but that VDJbase doesn't seem to be storing it there. Is this something lacking in the AIRR Standards? Or we can just resolve it for the AKC.

williamdlees · 2025-10-01T09:29:01Z

The request to VDJbase returns repertoire metadata including genotypes - the gentotypes are embedded in the repertoires - so the info is linked. William

…

------ Original Message ------ From "Scott Christley" ***@***.***> To "airr-knowledge/ak-etvl" ***@***.***> Cc "William Lees" ***@***.***>; "Assign" ***@***.***> Date 25/09/2025 18:05:26 Subject Re: [airr-knowledge/ak-etvl] Genotype integration from VDJbase (PR #3)

schristley left a comment (airr-knowledge/ak-etvl#3) <#3 (comment)> I was looking at the genotype data and realizing that it doesn’t really have links back to the repertoire, so I don’t think my idea of looping through the genotypes will work? I see there is a “subject_name” field which I guess is a custom field for VDJbase, but if I search for that subject_name in the repertoires then I don’t find it as a subject_id or repertoire_id. For the AIRR Standards, I guess we didn’t formulate exactly how to link genotype sets with repertoires/subjects? There is the genotype field within the AIRR Subject, but that VDJbase doesn't seem to be storing it there. Is this something lacking in the AIRR Standards? Or we can just resolve it for the AKC. — Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACXBI7JXRCLURPNAGRN5RUT3UQOFNAVCNFSM6AAAAACHMW4YM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGMZVGEYDOMRZGY>. You are receiving this because you were assigned.Message ID: ***@***.***>

williamdlees · 2025-10-04T08:37:07Z

Hi Scott. Having looked at this further I realised that there was a substantial issue here. The genotypes returned by VDJbase describe the subject and sample that they refer to by the VDJbase name (eg P1_I2_S3 where P, I and S refer to the project, Individual and Sample). The VDJbase name for the repoertoire is also returned in the VDJbase repertoire metadata, but it is not recognised by the akc schema because this is not a defined field in the AIRR schema.

Arguably the problem is caused by the API call which returns all genotypes in a single file with no other repertoire metadata, and also by the AIRR schema which links genotypes directly to subject without referring to the sample from which they were derived. I argued hard against this definition at the time.

I have addressed the problem by pre-scanning the VDJbase repertoire metadata in vdjbase_metadata_transform.py, in order to build a dictionary that translates vdjbase sample name to the airr repertoire study_id and subject_id. This is further translated after repertoire processing into a map between vdjbase name and the akc_ids for investigation and participant. Other akc_ids could be provided in the mapping if necessary. I think this gives us a workable implementation, although one that could be improved by refining the AIRR schema, given the will of those that would be involved.

I have checked in the code, which is working up to this point.

The next question is how the genotypes should actually be linked: what objects do you want me to create and what fields should I populate ? I can't for example find an Assay for genotyping, or a Conclusion. Perhaps there is a field I have overlooked. At any rate if you could please spell this out for me I think I can finish this code fairly quickly.

Thanks

William

schristley · 2025-10-06T21:13:41Z

Hi William, I'm finally getting a chance to look at this again.

The VDJbase name for the repoertoire is also returned in the VDJbase repertoire metadata, but it is not recognised by the akc schema because this is not a defined field in the AIRR schema.

Is that something we should add to the AKC schema? If you've extended AIRR to better support the mapping then we should be able to support it directly in AKC.

Arguably the problem is caused by the API call which returns all genotypes in a single file with no other repertoire metadata, and also by the AIRR schema which links genotypes directly to subject without referring to the sample from which they were derived. I argued hard against this definition at the time.

Ok, yeah I partially remember this conversation. What was your preference, to put the genotype with the repertoire?

I have addressed the problem by pre-scanning the VDJbase repertoire metadata in vdjbase_metadata_transform.py, in order to build a dictionary that translates vdjbase sample name to the airr repertoire study_id and subject_id. This is further translated after repertoire processing into a map between vdjbase name and the akc_ids for investigation and participant. Other akc_ids could be provided in the mapping if necessary. I think this gives us a workable implementation, although one that could be improved by refining the AIRR schema, given the will of those that would be involved.

Ok, I hope that wasn't unneeded work. Would it be simplified in some way by updating the AKC schema? I guess the API could also be modified if necessary.

schristley · 2025-10-07T00:05:47Z

Hi William, I pushed code for an load_adc_container() function which will load up most AK metadata for the ADC. Pass it a container object and it will put the data in it:

container = AIRRKnowledgeCommons()
load_adc_container(container)
print(container)

Next is to maybe write a simple function that returns the AK Investigation based upon a study_id

I need to send you a TGZ copy of the transformed data though. I'll tar that up and email to you.

williamdlees · 2025-10-07T15:37:47Z

Thanks Scott. I don't think we need to look at schema changes right now. The code is working and it's handling the data correctly. We can refine it a bit later but I think things are ok for the time being.

What I really need now are answers to these questions:

The next question is how the genotypes should actually be linked: what objects do you want me to create and what fields should I populate ? I can't for example find an Assay for genotyping, or a Conclusion. Perhaps there is a field I have overlooked. At any rate if you could please spell this out for me I think I can finish this code fairly quickly.

That will let me write AKC objects from VDJbase data in the same way that adc_repertoire_transform.py writes AKC objects from ADC data.

I am a bit confused about load_adc_container and where it fits, and why the data you are planning to zip up is going to be helpful to me, perhaps we can find some time to talk about that in one of the coming calls.

schristley · 2025-10-07T22:56:20Z

I am a bit confused about load_adc_container and where it fits, and why the data you are planning to zip up is going to be helpful to me, perhaps we can find some time to talk about that in one of the coming calls.

If the genotype is for a study that's already in the ADC, we don't want to re-transform VDJbase's repertoire metadata. The reason is that different AKC IDs will be assigned, and it will end up creating duplicate study data.

But this isn't that critical at this stage I guess. The code that is calling transform_airr_repertoires on the VDJbase repertoires will need to be separated at some point and/or rewritten in a way so that if the study/repertoire has already been transformed to AKC, it returns those existing objects instead of creating new ones.

Your code that creates mappings, subj_id_to_participant and study_id_to_investigation are likely still needed.

schristley · 2025-10-07T23:12:55Z

What I really need now are answers to these questions:

The next question is how the genotypes should actually be linked: what objects do you want me to create and what fields should I populate ? I can't for example find an Assay for genotyping, or a Conclusion. Perhaps there is a field I have overlooked. At any rate if you could please spell this out for me I think I can finish this code fairly quickly.

The assay is AIRRSequencingAssay. In transform_airr_repertoire, it is created near the end. I'll note that the assay also has the repertoire_id for the ADC repertoire, though I don't know if this will help you because for shared studies (ADC and VDJbase) if the repertoire IDs stay the same.

Getting the participant for that assay then requires following a few links. The assay should point to a Specimen. The Specimen points to a LifeEvent (specimen collection), and that LifeEvent points to the Participant.

We are going to represent the Genotype as a DataSet that comes out of a DataTransformation, where the input is the AIRRSequencingData from the assay and the output is AIRRGenotypeData (AKC representation of the AIRR Genotype) This is rough pseudo-code:

# given 'a' is the AIRRSequencingAssay object
# get the AIRRSequencingData from the assay
asd = a.sequencing_files

d = DataTransformation(
    akc_id(),
    data_transformation_types=['genotype_inference'])

g = AIRRGenotypeData(
    akc_id(),
    data_item_types=['genotype'],
    # other Genotype fields)

m = InputOutputDataMap(
    data_transformation=d.akc_id,
    has_specified_input=asd.akc_id,
    has_specified_output=g.akc_id)

# not sure if this is right
# d.was_generated_by[m.akc_id] = m

What's missing then is for these objects to be into the container so they can be serialized to disk. There is datasets for holding the genotype data, but it looks like we are missing slots for the DataTransformation and for the InputOutputDataMap.

container['datasets'][g.akc_id] = g

williamdlees · 2025-10-08T07:56:19Z

Hi Scott,

If the genotype is for a study that's already in the ADC, we don't want to re-transform VDJbase's repertoire metadata. The reason is that different AKC IDs will be assigned, and it will end up creating duplicate study data.

This is implemented in transform_airr_repertoires.py. Specifically at line 76, an Investigation is only created if the existing archival_id is not found in the container already. Likewise at lines 139 and 198, Participant and Specimen are only created if the existing objects are not in the container. Hence if the container is pre-loaded with current studies, they will not be re-processed. The code relies on IDs that should be the same even if the repertoire metadata is re-coded: specifically MiAIRR study_id, subject_id, sample_id. I'm sure we will find corner cases, though.

There is datasets for holding the genotype data, but it looks like we are missing slots for the DataTransformation and for the InputOutputDataMap.

Are you going to implement these? I'm ready to write the code (at around line 182 in vdjbase_metadata_transform.py) once the object definitions are in place.

schristley · 2025-10-08T14:54:39Z

Hi William,

This is implemented in transform_airr_repertoires.py.

Ok, great. I haven't had a chance to test the integration with your re-factoring, but hopefully I can before the next schema meeting.

Are you going to implement these? I'm ready to write the code (at around line 182 in vdjbase_metadata_transform.py) once the object definitions are in place.

Yes, I just pushed changes to the genotype branch for ak-schema which adds two slots.

williamdlees · 2025-10-15T09:48:41Z

Hi Scott,

A couple of things I'll raise this afternoon:

I think there may need to be a slot in the container for InputOutputDataMap.
The mapping of sequencing_files to GenotypeSet is a little complex, given that GenotypeSet will often contain Genotypes for more than one locus, and these Genotypes may (but not always) be derived from multiple SequencingFiles. I am not sure that we want to get too explicit about the mapping between the specific file and the specific genotype: that could well be difficult to discern from the data anyway, But I think it would be good to allow multiple InputOutputDataMap to take a list of inputs.

All the best

William

williamdlees · 2025-10-15T10:00:18Z

Sorry one additional thing for this afternnoon.

in ak_schema.py there is a definition of a GenotypeSet which takes as a member a list of Genotypes (genotype_class_list).

however AIRRGenotypeData takes a list of Genotypes rather than a GenotypeSet. I think it needs to be modified to take the GenotypeSet instead, or else we could just drop GenotypeSet.

the end of transform_airr_genotypes.py

…otype

williamdlees · 2025-10-15T15:52:26Z

This is another thing I mentioned on the call: creating a DataTransformation object seems to throw an error:

        '''
        This throws an error:
        Exception has occurred: ValueError
            DataTransformation({
            'akc_id': 'AKC:337ce816-a54c-432f-8208-47d63388e5ca',
            'data_transformation_types': [DataTransformationTypeEnum(text='genotype_inference')]
            }) is not a valid URI or CURIE
        '''
        data_transformation = DataTransformation(
            akc_id(),
            data_transformation_types=['genotype_inference'],
        )

schristley · 2025-10-15T17:59:02Z

Hi William, can you try this code?

         data_transformation = DataTransformation(
             akc_id(),
             data_transformation_types=[DataTransformationTypeEnum('genotype_inference')],
         )

This is another thing I mentioned on the call: creating a DataTransformation object seems to throw an error:

        data_transformation = DataTransformation(
            akc_id(),
            data_transformation_types=['genotype_inference'],
        )

williamdlees · 2025-10-16T13:01:38Z

Thanks, that works now, once I change the code to pass datatransformation.akc_id to InputOutputDataMap rather than passing the object. Without that it gives this weird 'is not a CURIE' error still, even though it's expecting a string.

Is there a reason we're using an out of date version of Python by the way? More recent versions like 3.12 and 3.13 have better stack traces and error reporting. 3.13 is a year old now, so not exactly bleeding edge.

schristley · 2025-10-19T22:19:29Z

Is there a reason we're using an out of date version of Python by the way? More recent versions like 3.12 and 3.13 have better stack traces and error reporting. 3.13 is a year old now, so not exactly bleeding edge.

Only because linkml was linked to that version, and the newer pythons have conflicts in the docker build. This will hopefully be cleaned up when upgrading to the new build system linkml uses.

schristley · 2025-10-19T22:54:32Z

Hi William,

I think there may need to be a slot in the container for InputOutputDataMap.

I added this and also had to make some changes to the output routines for InputOutputDataMap. It seems to be working now.

schristley · 2025-10-19T23:13:42Z

Hi William,

The mapping of sequencing_files to GenotypeSet is a little complex, given that GenotypeSet will often contain Genotypes for more than one locus, and these Genotypes may (but not always) be derived from multiple SequencingFiles. I am not sure that we want to get too explicit about the mapping between the specific file and the specific genotype: that could well be difficult to discern from the data anyway, But I think it would be good to allow multiple InputOutputDataMap to take a list of inputs.

I was originally thinking of AIRRGenotypeData to be mostly equivalent to AIRR GenotypeSet, but that might be too coarse-grained. Maybe a better idea is for AIRRGenotypeData to be mostly equivalent to AIRR Genotype, so can be more specific with the input files to output mapping.

InputOutputDataMap allows multiple inputs; they get specified as multiple objects, e.g., with same data transformation ID and output ID but different input ID. The three IDs together is a compound identifier.

williamdlees · 2025-10-22T10:50:33Z

Yes the genotype code seems to be working now.

Personally I think we should keep the input files to output mapping as it is, with potentially multiple InputOutputDataMap objects. I think the granularity is fine and we may not get a consistent mapping between the input files and the genotype at locus level, because it depends how the files were organised at the point of SRA submission

williamdlees and others added 7 commits August 31, 2025 10:33

Allow the environment variable AK_DATA_DIR to override the location of

cb9ada8

ak_Data_dir

Factor out the actual analysis of the miairr repertoire into a

04494f0

separarate function so that it can be re-used in vdjbase metadata analysis

Allow the location of akc-data to be set by an environment variable

3aa38dd

Factor repertoire analysis into a separate function for re-use

b4250aa

FIrst commit of vdjbase metadata analysis

33f3426

Add notes after meeting

b4f9fa2

Merge pull request #2 from williamdlees/genotype

645c267

Genotype

schristley assigned schristley and williamdlees Sep 24, 2025

schristley and others added 5 commits September 24, 2025 14:00

initial extract targets

c284845

add nano

acc1c10

push vdjbase data in own directory

8ec5169

schema updates

de28efb

separate import data area for vdjbase genotypes

5ee78be

williamdlees added 2 commits October 1, 2025 18:21

Handle multiple studies in a file

dd2d1bb

Fixes following tests against live VDJbase data

2a326cb

function to load AK data for ADC

fd48532

schristley and others added 2 commits October 13, 2025 19:34

fix to earlier python

b8694d0

avoid error for empty InputOutputMap

f226d86

Scott Christley added 3 commits October 14, 2025 19:07

bug fix

4a9bea9

add more fine-grained targets for ADC

6106ca8

add messages

dff5246

williamdlees added 3 commits October 15, 2025 11:22

Ensure specimens etc are only created once

ae7ca21

Current state of play for vdjbase takeon: see issues highlighted towards

2be6ef6

the end of transform_airr_genotypes.py

Merge branch 'genotype' of github.com:airr-knowledge/ak-etvl into gen…

7ce4fce

…otype

add fix for extraneous Unicode

33ed5df

add input_output_map

070a6bf

Remove comments on issues as the ussues are now resolved.

ebfde65

schristley marked this pull request as ready for review November 25, 2025 18:14

schristley merged commit 0db10f6 into main Nov 25, 2025

schristley deleted the genotype branch March 2, 2026 22:08

Conversation

schristley commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

schristley commented Sep 24, 2025

Uh oh!

schristley commented Sep 25, 2025

Uh oh!

williamdlees commented Oct 1, 2025 via email

Uh oh!

williamdlees commented Oct 4, 2025

Uh oh!

schristley commented Oct 6, 2025

Uh oh!

schristley commented Oct 7, 2025

Uh oh!

williamdlees commented Oct 7, 2025

Uh oh!

schristley commented Oct 7, 2025

Uh oh!

schristley commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

williamdlees commented Oct 8, 2025

Uh oh!

schristley commented Oct 8, 2025

Uh oh!

williamdlees commented Oct 15, 2025

Uh oh!

williamdlees commented Oct 15, 2025

Uh oh!

williamdlees commented Oct 15, 2025

Uh oh!

schristley commented Oct 15, 2025

Uh oh!

williamdlees commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

schristley commented Oct 19, 2025

Uh oh!

schristley commented Oct 19, 2025

Uh oh!

schristley commented Oct 19, 2025

Uh oh!

williamdlees commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

schristley commented Sep 24, 2025 •

edited

Loading

schristley commented Oct 7, 2025 •

edited

Loading

williamdlees commented Oct 16, 2025 •

edited

Loading