End-to-end example for generating embeddings with `get_embeddings.py`

Hi @AnnaChristina ,

I am interested in generating nicheformer embeddings using the available checkpoint. Is there an end-to-end tutorial showing how to tokenize a data source into the format expected by the model? I've tried the following but it seems like the `data/model_means/model.h5ad` file contains only a single observation (which appears to be unexpected in the related tokenization notebooks). 

1. I installed the `nicheformer` package and downloaded the pretained model weights from Mendeley.
2. I downloaded and preprocessed the exemplar spatial and dissociated datasets using the `download_*` and `preprocess_*` scripts in `data/spatialcorpus-110M/spatial/examplary-Xenium` and `data/spatialcorpus-110M/dissociated/Lu_2021`, respectively. I updated the default paths in the constants file.
3. Following [nicheformer/tree/main/notebooks/tokenization/xenium_human_lung.ipynb](https://github.com/theislab/nicheformer/blob/main/notebooks/tokenization/xenium_human_lung.ipynb), I tried to run the tokenization process for `.../spatial/preprocessed/Xenium_Preview_Human_Non_diseased_Lung_With_Add_on_FFPE_outs.h5ad`. I mapped `DATA_PATH ` to this `h5ad` (corresponds to `healthy` in the notebook?), `xenium_mean` to `data/model_means/xenium_mean_script.npy`, and `model` to `data/model_means/model.h5ad`.

It appears that my `xenium` object contains the expected `obs`, `var`, etc. subobjects with fewer samples than the shapes logged in the notebook. However, `model` seems to be missing observations?

```
AnnData object with n_obs × n_vars = 1 × 20310
    obs: 'soma_joinid', 'is_primary_data', 'dataset_id', 'donor_id', 'assay', 'cell_type', 'development_stage', 'disease', 'tissue', 'tissue_general', 'specie', 'technology', 'dataset', 'x', 'y', 'assay_ontology_term_id', 'sex_ontology_term_id', 'organism_ontology_term_id', 'tissue_ontology_term_id', 'suspension_type', 'condition_id', 'tissue_type', 'library_key', 'organism', 'sex', 'niche', 'region', 'nicheformer_split', 'author_cell_type', 'batch'
```

I believe this breaks the inner join in the following block since the resulting post-join `xenium` object ends up with a shape of `AnnData object with n_obs × n_vars = 295883 × 391` rather than the notebook's logged `AnnData object with n_obs × n_vars = 827048 × 20310`:

```
adata = ad.concat([model, xenium], join='inner', axis=0)
# dropping the first observation 
xenium = adata[1:].copy()
# for memory efficiency <
del adata
```

Would it be possible to add an updated `model.h5ad` file and a more detailed end-to-end example so that we can try to format out datasets to match the tokenized representations expected by the `Nicheformer.get_embeddings()` method?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

End-to-end example for generating embeddings with `get_embeddings.py` #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

End-to-end example for generating embeddings with get_embeddings.py #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

End-to-end example for generating embeddings with `get_embeddings.py` #12