scverse datastructure for AIRR data

Now that scirpy is part of scverse, we could think of an improved data structure for scAIRR data. See also the discussion at https://github.com/theislab/scanpy/issues/1387. 

The challenge with scAIRR data is that
 * `1` cell can have `n` chains. Up to four of them are biologically meaningful but there could be more for technical reasons.
 * Each chain has a lot of fields. See the [AIRR rearrangement standard](https://docs.airr-community.org/en/latest/datarep/rearrangements.html). 

The current pragmatic solution is to store all fields in `adata.obs`. 
 * All columns from the airr rearrangement schema are repeated four times
 * Excess chains are serialized into JSON and stored in an extra column. These chains are not used by scirpy, but enable lossless conversions. 
 * The downside is that there can easily be 100+ columns in `adata.obs`. Also serializing excess chains is not really elegant. 
 * The advantage is that it works really well with scanpy, i.e. any AIRR variable can immediately be used for grouping, plotting etc. 

New options are
 * **mudata**. AIRR data could be saved as a separate modality. Even if we keep the current reprepsentation of a wide data frame, it would at least not clutter the rest of `adata.obs`. 
 * **awkward array support**. Allows storing an arbitrary number of values per row. See https://github.com/theislab/anndata/pull/647

The new representation should also aim at being a community standard for the scverse ecosystem and should build upon the AIRR rearrangement standard. Ideally, we could get additional stakeholders onboard, including conga, dandelion, tcrdist3 and possibly members of the AIRR community. 

- [x] what's the state of the AIRR single-cell schema? And what are its advantates over the rearrangement schema. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scverse datastructure for AIRR data #327

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

scverse datastructure for AIRR data #327

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions