-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Hi @AnnaChristina ,
I am interested in generating nicheformer embeddings using the available checkpoint. Is there an end-to-end tutorial showing how to tokenize a data source into the format expected by the model? I've tried the following but it seems like the data/model_means/model.h5ad
file contains only a single observation (which appears to be unexpected in the related tokenization notebooks).
- I installed the
nicheformer
package and downloaded the pretained model weights from Mendeley. - I downloaded and preprocessed the exemplar spatial and dissociated datasets using the
download_*
andpreprocess_*
scripts indata/spatialcorpus-110M/spatial/examplary-Xenium
anddata/spatialcorpus-110M/dissociated/Lu_2021
, respectively. I updated the default paths in the constants file. - Following nicheformer/tree/main/notebooks/tokenization/xenium_human_lung.ipynb, I tried to run the tokenization process for
.../spatial/preprocessed/Xenium_Preview_Human_Non_diseased_Lung_With_Add_on_FFPE_outs.h5ad
. I mappedDATA_PATH
to thish5ad
(corresponds tohealthy
in the notebook?),xenium_mean
todata/model_means/xenium_mean_script.npy
, andmodel
todata/model_means/model.h5ad
.
It appears that my xenium
object contains the expected obs
, var
, etc. subobjects with fewer samples than the shapes logged in the notebook. However, model
seems to be missing observations?
AnnData object with n_obs × n_vars = 1 × 20310
obs: 'soma_joinid', 'is_primary_data', 'dataset_id', 'donor_id', 'assay', 'cell_type', 'development_stage', 'disease', 'tissue', 'tissue_general', 'specie', 'technology', 'dataset', 'x', 'y', 'assay_ontology_term_id', 'sex_ontology_term_id', 'organism_ontology_term_id', 'tissue_ontology_term_id', 'suspension_type', 'condition_id', 'tissue_type', 'library_key', 'organism', 'sex', 'niche', 'region', 'nicheformer_split', 'author_cell_type', 'batch'
I believe this breaks the inner join in the following block since the resulting post-join xenium
object ends up with a shape of AnnData object with n_obs × n_vars = 295883 × 391
rather than the notebook's logged AnnData object with n_obs × n_vars = 827048 × 20310
:
adata = ad.concat([model, xenium], join='inner', axis=0)
# dropping the first observation
xenium = adata[1:].copy()
# for memory efficiency <
del adata
Would it be possible to add an updated model.h5ad
file and a more detailed end-to-end example so that we can try to format out datasets to match the tokenized representations expected by the Nicheformer.get_embeddings()
method?