Skip to content

Genotype integration from VDJbase#3

Merged
schristley merged 26 commits into
mainfrom
genotype
Nov 25, 2025
Merged

Genotype integration from VDJbase#3
schristley merged 26 commits into
mainfrom
genotype

Conversation

@schristley
Copy link
Copy Markdown
Member

@schristley schristley commented Sep 24, 2025

We can split this up into two scenarios:

  1. Genotype data where the AIRR-seq data resides only in VDJbase.
  2. Genotype data where the AIRR-seq data resides in both ADC and VDJbase (though the sequence data has likely been processed differently).

I think we should focus on this latter one as it is can be used for both. In theory, part 1 can split into two steps: 1) use the standard ADC integration to bring in the VDJbase data then 2) use the same part 2 above to bring in genotype data.

The challenge will be to link up the identifiers between the study, subject and genotype. Once we have that worked out, then it will be easier to dig into the details of how the genotype data is represented.

A few studies (not exhaustive) that already exist in ADC that we can use for testing:

  • PRJNA300878 (cache: 6508961642208563691-242ac113-0001-012)
  • PRJNA509910 (cache: 8575123754278514195-242ac11b-0001-012)
  • PRJNA680539 (cache: 6270798281029250580-242ac117-0001-012)

@schristley
Copy link
Copy Markdown
Member Author

My initial idea of the transform process goes like this:

  1. Load up the already transformed ADC metadata (Investigations, Participants, Assays, and maybe others).
  2. Load up the genotype data in AIRR format from VDJbase.
  3. Load up the AIRR repertoire data in AIRR format from VDJbase? (I think this is needed)
  4. For each genotype:
  5. Find the appropriate study/subject in the AIRR repertoire for the genotype.
  6. Search the transformed ADC metadata for the Investigation, based on study_id.
  7. Search the transformed ADC metadata for the Participant, based on subject_id?
  8. Find the AIRRSequencingAssay for the Participant, add the objects for the genotype_inference process and for the AIRRGenotypeData.

AIRRSequencingAssay —sequencing_files—> AIRRSequencingData <—has_specified_input— DataTransformation —has_specified_output—>GenotypeData

  • AIRRSequencingAssay for the AIRR-seq experiment (this already exists)
  • AIRRSequencingFiles is raw FASTQ data and is the sequencing_files for that AIRRSequencingAssay (this already exists)
  • DataTransformation for the genotype inference (new object)
  • AIRRGenotypeData with genotype enum to represent the genotype data (new object)
  • Lastly, entry in the InputOutputDataMap that links input (AIRRSequencingFiles) with output (genotype data item) for that DataTransformation

While this structure conforms to the data model, I'm not sure how efficient queries will be because there are numerous joins/tables in SQL between the Participant and the actual genotype data. We might need to decide to add some more direct relationships.

@schristley
Copy link
Copy Markdown
Member Author

I was looking at the genotype data and realizing that it doesn’t really have links back to the repertoire, so I don’t think my idea of looping through the genotypes will work?

I see there is a “subject_name” field which I guess is a custom field for VDJbase, but if I search for that subject_name in the repertoires then I don’t find it as a subject_id or repertoire_id.

For the AIRR Standards, I guess we didn’t formulate exactly how to link genotype sets with repertoires/subjects? There is the genotype field within the AIRR Subject, but that VDJbase doesn't seem to be storing it there. Is this something lacking in the AIRR Standards? Or we can just resolve it for the AKC.

@williamdlees
Copy link
Copy Markdown
Contributor

williamdlees commented Oct 1, 2025 via email

@williamdlees
Copy link
Copy Markdown
Contributor

Hi Scott. Having looked at this further I realised that there was a substantial issue here. The genotypes returned by VDJbase describe the subject and sample that they refer to by the VDJbase name (eg P1_I2_S3 where P, I and S refer to the project, Individual and Sample). The VDJbase name for the repoertoire is also returned in the VDJbase repertoire metadata, but it is not recognised by the akc schema because this is not a defined field in the AIRR schema.

Arguably the problem is caused by the API call which returns all genotypes in a single file with no other repertoire metadata, and also by the AIRR schema which links genotypes directly to subject without referring to the sample from which they were derived. I argued hard against this definition at the time.

I have addressed the problem by pre-scanning the VDJbase repertoire metadata in vdjbase_metadata_transform.py, in order to build a dictionary that translates vdjbase sample name to the airr repertoire study_id and subject_id. This is further translated after repertoire processing into a map between vdjbase name and the akc_ids for investigation and participant. Other akc_ids could be provided in the mapping if necessary. I think this gives us a workable implementation, although one that could be improved by refining the AIRR schema, given the will of those that would be involved.

I have checked in the code, which is working up to this point.

The next question is how the genotypes should actually be linked: what objects do you want me to create and what fields should I populate ? I can't for example find an Assay for genotyping, or a Conclusion. Perhaps there is a field I have overlooked. At any rate if you could please spell this out for me I think I can finish this code fairly quickly.

Thanks

William

@schristley
Copy link
Copy Markdown
Member Author

Hi William, I'm finally getting a chance to look at this again.

The VDJbase name for the repoertoire is also returned in the VDJbase repertoire metadata, but it is not recognised by the akc schema because this is not a defined field in the AIRR schema.

Is that something we should add to the AKC schema? If you've extended AIRR to better support the mapping then we should be able to support it directly in AKC.

Arguably the problem is caused by the API call which returns all genotypes in a single file with no other repertoire metadata, and also by the AIRR schema which links genotypes directly to subject without referring to the sample from which they were derived. I argued hard against this definition at the time.

Ok, yeah I partially remember this conversation. What was your preference, to put the genotype with the repertoire?

I have addressed the problem by pre-scanning the VDJbase repertoire metadata in vdjbase_metadata_transform.py, in order to build a dictionary that translates vdjbase sample name to the airr repertoire study_id and subject_id. This is further translated after repertoire processing into a map between vdjbase name and the akc_ids for investigation and participant. Other akc_ids could be provided in the mapping if necessary. I think this gives us a workable implementation, although one that could be improved by refining the AIRR schema, given the will of those that would be involved.

Ok, I hope that wasn't unneeded work. Would it be simplified in some way by updating the AKC schema? I guess the API could also be modified if necessary.

@schristley
Copy link
Copy Markdown
Member Author

Hi William, I pushed code for an load_adc_container() function which will load up most AK metadata for the ADC. Pass it a container object and it will put the data in it:

container = AIRRKnowledgeCommons()
load_adc_container(container)
print(container)

Next is to maybe write a simple function that returns the AK Investigation based upon a study_id

I need to send you a TGZ copy of the transformed data though. I'll tar that up and email to you.

@williamdlees
Copy link
Copy Markdown
Contributor

Thanks Scott. I don't think we need to look at schema changes right now. The code is working and it's handling the data correctly. We can refine it a bit later but I think things are ok for the time being.

What I really need now are answers to these questions:

The next question is how the genotypes should actually be linked: what objects do you want me to create and what fields should I populate ? I can't for example find an Assay for genotyping, or a Conclusion. Perhaps there is a field I have overlooked. At any rate if you could please spell this out for me I think I can finish this code fairly quickly.

That will let me write AKC objects from VDJbase data in the same way that adc_repertoire_transform.py writes AKC objects from ADC data.

I am a bit confused about load_adc_container and where it fits, and why the data you are planning to zip up is going to be helpful to me, perhaps we can find some time to talk about that in one of the coming calls.

@schristley
Copy link
Copy Markdown
Member Author

I am a bit confused about load_adc_container and where it fits, and why the data you are planning to zip up is going to be helpful to me, perhaps we can find some time to talk about that in one of the coming calls.

If the genotype is for a study that's already in the ADC, we don't want to re-transform VDJbase's repertoire metadata. The reason is that different AKC IDs will be assigned, and it will end up creating duplicate study data.

But this isn't that critical at this stage I guess. The code that is calling transform_airr_repertoires on the VDJbase repertoires will need to be separated at some point and/or rewritten in a way so that if the study/repertoire has already been transformed to AKC, it returns those existing objects instead of creating new ones.

Your code that creates mappings, subj_id_to_participant and study_id_to_investigation are likely still needed.

@schristley
Copy link
Copy Markdown
Member Author

schristley commented Oct 7, 2025

What I really need now are answers to these questions:

The next question is how the genotypes should actually be linked: what objects do you want me to create and what fields should I populate ? I can't for example find an Assay for genotyping, or a Conclusion. Perhaps there is a field I have overlooked. At any rate if you could please spell this out for me I think I can finish this code fairly quickly.

The assay is AIRRSequencingAssay. In transform_airr_repertoire, it is created near the end. I'll note that the assay also has the repertoire_id for the ADC repertoire, though I don't know if this will help you because for shared studies (ADC and VDJbase) if the repertoire IDs stay the same.

Getting the participant for that assay then requires following a few links. The assay should point to a Specimen. The Specimen points to a LifeEvent (specimen collection), and that LifeEvent points to the Participant.

We are going to represent the Genotype as a DataSet that comes out of a DataTransformation, where the input is the AIRRSequencingData from the assay and the output is AIRRGenotypeData (AKC representation of the AIRR Genotype) This is rough pseudo-code:

# given 'a' is the AIRRSequencingAssay object
# get the AIRRSequencingData from the assay
asd = a.sequencing_files

d = DataTransformation(
    akc_id(),
    data_transformation_types=['genotype_inference'])

g = AIRRGenotypeData(
    akc_id(),
    data_item_types=['genotype'],
    # other Genotype fields)

m = InputOutputDataMap(
    data_transformation=d.akc_id,
    has_specified_input=asd.akc_id,
    has_specified_output=g.akc_id)

# not sure if this is right
# d.was_generated_by[m.akc_id] = m

What's missing then is for these objects to be into the container so they can be serialized to disk. There is datasets for holding the genotype data, but it looks like we are missing slots for the DataTransformation and for the InputOutputDataMap.

container['datasets'][g.akc_id] = g

@williamdlees
Copy link
Copy Markdown
Contributor

Hi Scott,

If the genotype is for a study that's already in the ADC, we don't want to re-transform VDJbase's repertoire metadata. The reason is that different AKC IDs will be assigned, and it will end up creating duplicate study data.

This is implemented in transform_airr_repertoires.py. Specifically at line 76, an Investigation is only created if the existing archival_id is not found in the container already. Likewise at lines 139 and 198, Participant and Specimen are only created if the existing objects are not in the container. Hence if the container is pre-loaded with current studies, they will not be re-processed. The code relies on IDs that should be the same even if the repertoire metadata is re-coded: specifically MiAIRR study_id, subject_id, sample_id. I'm sure we will find corner cases, though.

There is datasets for holding the genotype data, but it looks like we are missing slots for the DataTransformation and for the InputOutputDataMap.

Are you going to implement these? I'm ready to write the code (at around line 182 in vdjbase_metadata_transform.py) once the object definitions are in place.

@schristley
Copy link
Copy Markdown
Member Author

Hi William,

This is implemented in transform_airr_repertoires.py.

Ok, great. I haven't had a chance to test the integration with your re-factoring, but hopefully I can before the next schema meeting.

Are you going to implement these? I'm ready to write the code (at around line 182 in vdjbase_metadata_transform.py) once the object definitions are in place.

Yes, I just pushed changes to the genotype branch for ak-schema which adds two slots.

@williamdlees
Copy link
Copy Markdown
Contributor

Hi Scott,

A couple of things I'll raise this afternoon:

  1. I think there may need to be a slot in the container for InputOutputDataMap.

  2. The mapping of sequencing_files to GenotypeSet is a little complex, given that GenotypeSet will often contain Genotypes for more than one locus, and these Genotypes may (but not always) be derived from multiple SequencingFiles. I am not sure that we want to get too explicit about the mapping between the specific file and the specific genotype: that could well be difficult to discern from the data anyway, But I think it would be good to allow multiple InputOutputDataMap to take a list of inputs.

All the best

William

@williamdlees
Copy link
Copy Markdown
Contributor

Sorry one additional thing for this afternnoon.

in ak_schema.py there is a definition of a GenotypeSet which takes as a member a list of Genotypes (genotype_class_list).

however AIRRGenotypeData takes a list of Genotypes rather than a GenotypeSet. I think it needs to be modified to take the GenotypeSet instead, or else we could just drop GenotypeSet.

@williamdlees
Copy link
Copy Markdown
Contributor

This is another thing I mentioned on the call: creating a DataTransformation object seems to throw an error:

        '''
        This throws an error:
        Exception has occurred: ValueError
            DataTransformation({
            'akc_id': 'AKC:337ce816-a54c-432f-8208-47d63388e5ca',
            'data_transformation_types': [DataTransformationTypeEnum(text='genotype_inference')]
            }) is not a valid URI or CURIE
        '''
        data_transformation = DataTransformation(
            akc_id(),
            data_transformation_types=['genotype_inference'],
        )


@schristley
Copy link
Copy Markdown
Member Author

Hi William, can you try this code?

         data_transformation = DataTransformation(
             akc_id(),
             data_transformation_types=[DataTransformationTypeEnum('genotype_inference')],
         )
This is another thing I mentioned on the call: creating a DataTransformation object seems to throw an error:

        data_transformation = DataTransformation(
            akc_id(),
            data_transformation_types=['genotype_inference'],
        )

@williamdlees
Copy link
Copy Markdown
Contributor

williamdlees commented Oct 16, 2025

Thanks, that works now, once I change the code to pass datatransformation.akc_id to InputOutputDataMap rather than passing the object. Without that it gives this weird 'is not a CURIE' error still, even though it's expecting a string.

Is there a reason we're using an out of date version of Python by the way? More recent versions like 3.12 and 3.13 have better stack traces and error reporting. 3.13 is a year old now, so not exactly bleeding edge.

@schristley
Copy link
Copy Markdown
Member Author

Is there a reason we're using an out of date version of Python by the way? More recent versions like 3.12 and 3.13 have better stack traces and error reporting. 3.13 is a year old now, so not exactly bleeding edge.

Only because linkml was linked to that version, and the newer pythons have conflicts in the docker build. This will hopefully be cleaned up when upgrading to the new build system linkml uses.

@schristley
Copy link
Copy Markdown
Member Author

Hi William,

  1. I think there may need to be a slot in the container for InputOutputDataMap.

I added this and also had to make some changes to the output routines for InputOutputDataMap. It seems to be working now.

@schristley
Copy link
Copy Markdown
Member Author

Hi William,

  1. The mapping of sequencing_files to GenotypeSet is a little complex, given that GenotypeSet will often contain Genotypes for more than one locus, and these Genotypes may (but not always) be derived from multiple SequencingFiles. I am not sure that we want to get too explicit about the mapping between the specific file and the specific genotype: that could well be difficult to discern from the data anyway, But I think it would be good to allow multiple InputOutputDataMap to take a list of inputs.

I was originally thinking of AIRRGenotypeData to be mostly equivalent to AIRR GenotypeSet, but that might be too coarse-grained. Maybe a better idea is for AIRRGenotypeData to be mostly equivalent to AIRR Genotype, so can be more specific with the input files to output mapping.

InputOutputDataMap allows multiple inputs; they get specified as multiple objects, e.g., with same data transformation ID and output ID but different input ID. The three IDs together is a compound identifier.

@williamdlees
Copy link
Copy Markdown
Contributor

Yes the genotype code seems to be working now.

Personally I think we should keep the input files to output mapping as it is, with potentially multiple InputOutputDataMap objects. I think the granularity is fine and we may not get a consistent mapping between the input files and the genotype at locus level, because it depends how the files were organised at the point of SRA submission

@schristley schristley marked this pull request as ready for review November 25, 2025 18:14
@schristley schristley merged commit 0db10f6 into main Nov 25, 2025
@schristley schristley deleted the genotype branch March 2, 2026 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants