Closed
Description
Now that scirpy is part of scverse, we could think of an improved data structure for scAIRR data. See also the discussion at scverse/scanpy#1387.
The challenge with scAIRR data is that
1
cell can haven
chains. Up to four of them are biologically meaningful but there could be more for technical reasons.- Each chain has a lot of fields. See the AIRR rearrangement standard.
The current pragmatic solution is to store all fields in adata.obs
.
- All columns from the airr rearrangement schema are repeated four times
- Excess chains are serialized into JSON and stored in an extra column. These chains are not used by scirpy, but enable lossless conversions.
- The downside is that there can easily be 100+ columns in
adata.obs
. Also serializing excess chains is not really elegant. - The advantage is that it works really well with scanpy, i.e. any AIRR variable can immediately be used for grouping, plotting etc.
New options are
- mudata. AIRR data could be saved as a separate modality. Even if we keep the current reprepsentation of a wide data frame, it would at least not clutter the rest of
adata.obs
. - awkward array support. Allows storing an arbitrary number of values per row. See first attempt to support awkward arrays anndata#647
The new representation should also aim at being a community standard for the scverse ecosystem and should build upon the AIRR rearrangement standard. Ideally, we could get additional stakeholders onboard, including conga, dandelion, tcrdist3 and possibly members of the AIRR community.
- what's the state of the AIRR single-cell schema? And what are its advantates over the rearrangement schema.
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
Done