Skip to content

spaLDVAE

Tian Tian edited this page Oct 23, 2023 · 4 revisions

src/spaVAE

spaLDVAE is spaVAE with linear decoder, which can be used for detecting spatially variable genes or peaks.

Wrap script to run spaLDVAE model for SRT data.

Parameters:

--data_file: data file name.
--select_genes: number of selected genes for analysis, default = 0 means no filtering. It will use the mean-variance relationship to select informative genes.
--batch_size: mini-batch size, default = 256.
--maxiter: number of max training iterations, default = 2000.
--lr: learning rate, default = 1e-3.
--weight_decay: weight decay coefficient, default = 1e-6.
--noise: coefficient of random Gaussian noise for the encoder, default = 0.
--dropoutE: dropout probability for encoder, default = 0.
--dropoutD: dropout probability for decoder, default = 0.
--encoder_layers: hidden layer sizes of encoder, default = [128, 64].
--z_dim: size of bottleneck layer, default = 15. Both the GP and Gaussian embeddings will be set to have dimensions of z_dim.
--beta: coefficient of the reconstruction loss, default = 10.
--num_samples: number of samplings of the posterior distribution of latent embedding during training, default = 1.
--fix_inducing_points: fixed or trainable inducing points, default = True, which means inducing points are fixed.
--grid_inducing_points: whether to use 2D grid inducing points or k-means centroids of positions as inducing points, default = True. "True" for 2D grid, "False" for k-means centroids.
--inducing_point_steps: if using 2D grid inducing points, set the number of 2D grid steps, default = None. Needed when grid_inducing_points = True.
--inducing_point_nums: if using k-means centroids on positions, set the number of inducing points, default = None. Needed when grid_inducing_points = False.
--fixed_gp_params: kernel scale is fixed or not, default = False, which means kernel scale is trainable.
--loc_range: positional locations will be scaled to the specified range. For example, loc_range = 20 means x and y locations will be scaled to the range 0 to 20, default = 20.
--kernel_scale: initial kernel scale, default = 20.
--model_file: file name to save weights of the model, default = model.pt
--final_latent_file: file name to output final latent representations, default = final_latent.txt.
--denoised_counts_file: file name to output denoised counts, default = denoised_mean.txt.
--device: pytorch device, default = cuda.

The most critical parameter is inducing_point_steps or inducing_point_nums, which controls the number of inducing points in the Gaussian process prior. Less number of inducing points would have higher computational efficiency, but more number could capture more complex spatial patterns. If using inducing_point_steps, then n_inducing_points = $(\text{inducing\_point\_steps}+1)^2$.

Unlike the GP embedding in the spaVAE model, in the spaLDVAE model, we allow each dimension of the GP embedding to have its own kernel scale.

Wrap script to run spaPeakLDVAE model for spatial ATAC-seq data.

Parameters:

--data_file: data file name.
--select_genes: number of selected genes for analysis, default = 0 means no filtering. It will use the mean-variance relationship to select informative genes.
--batch_size: mini-batch size, default = 256.
--maxiter: number of max training iterations, default = 2000.
--lr: learning rate, default = 1e-3.
--weight_decay: weight decay coefficient, default = 1e-6.
--noise: coefficient of random Gaussian noise for the encoder, default = 0.
--dropoutE: dropout probability for encoder, default = 0.
--dropoutD: dropout probability for decoder, default = 0.
--encoder_layers: hidden layer sizes of encoder, default = [1024, 128].
--z_dim: size of bottleneck layer, default = 15. Both the GP and Gaussian embeddings will be set to have dimensions of z_dim.
--beta: coefficient of the reconstruction loss, default = 10.
--num_samples: number of samplings of the posterior distribution of latent embedding during training, default = 1.
--fix_inducing_points: fixed or trainable inducing points, default = True, which means inducing points are fixed.
--grid_inducing_points: whether to use 2D grid inducing points or k-means centroids of positions as inducing points, default = True. "True" for 2D grid, "False" for k-means centroids.
--inducing_point_steps: if using 2D grid inducing points, set the number of 2D grid steps, default = None. Needed when grid_inducing_points = True.
--inducing_point_nums: if using k-means centroids on positions, set the number of inducing points, default = None. Needed when grid_inducing_points = False.
--fixed_gp_params: kernel scale is fixed or not, default = False, which means kernel scale is trainable.
--loc_range: positional locations will be scaled to the specified range. For example, loc_range = 20 means x and y locations will be scaled to the range 0 to 20, default = 20.
--kernel_scale: initial kernel scale, default = 20.
--model_file: file name to save weights of the model, default = model.pt
--final_latent_file: file name to output final latent representations, default = final_latent.txt.
--denoised_counts_file: file name to output denoised counts, default = denoised_mean.txt.
--device: pytorch device, default = cuda.

The most critical parameter is inducing_point_steps or inducing_point_nums, which controls the number of inducing points in the Gaussian process prior. Less number of inducing points would have higher computational efficiency, but more number could capture more complex spatial patterns. If using inducing_point_steps, then n_inducing_points = $(\text{inducing\_point\_steps}+1)^2$.

Unlike the GP embedding in the spaPeakVAE model, in the spaPeakLDVAE model, we allow each dimension of the GP embedding to have its own kernel scale.

spaLDVAE model for SRT data.

forward:

Forward pass.

PARAMETERS:

  • x: tensor, mini-batch of spatial locations.
  • y: tensor, mini-batch of preprocessed counts.
  • raw_y: tensor, mini-batch of raw counts.
  • size_factor: tensor, mini-batch of size factors.
  • num_samples: tensor, number of samplings of the posterior distribution of latent embedding.
  • raw_y and size_factor are used for NB likelihood.

RETURNS:

  • Tuple of tensors need for model training.

spatial_score:

Return spatial score for each gene, which is quantified by the reconstruction importance of the GP embedding part.

PARAMETERS:

  • batch_size: default = 256, mini-batch size.
  • n_samples: default = 25, number of samplings of the posterior distribution of latent embedding. The denoised counts are average of the samplings.
  • gene_name: numpy array, shape (n_spots, 2), location information.

RETURNS:

  • Pandas dataframe. Columns are "spatial_score": reconstruction importance of the GP embedding part, "non_spatial_score": reconstruction importance of the Gaussian embedding part.

train_model:

Model training function.

PARAMETERS:

  • pos: numpy array, shape (n_spots, 2), location information.
  • ncounts: numpy array, shape (n_spots, n_genes), preprocessed count matrix.
  • raw_counts: numpy array, shape (n_spots, n_genes), raw count matrix.
  • size_factor: numpy array, shape (n_spots), the size factor of each spot, which is need for the NB loss.
  • lr: default = 0.001, learning rate for AdamW optimizer.
  • weight_decay: default = 0.001, weight decay for AdamW optimizer.
  • maxiter: default = 2000, maximum number of iterations.
  • save_model: default = True, whether to save the model weights.
  • model_weights: default = "model.pt", file name to save the model weights.
  • print_kernel_scale: default = True, whether to print current kernel scale during training steps.

spaLDVAE model for spatial ATAC-seq data.

forward:

Forward pass.

PARAMETERS:

  • x: tensor, mini-batch of spatial locations.
  • y: tensor, mini-batch of preprocessed (binarized) counts.
  • num_samples: tensor, number of samplings of the posterior distribution of latent embedding.

RETURNS:

  • Tuple of tensors need for model training.

spatial_score:

Return spatial score for each gene, which is quantified by the reconstruction importance of the GP embedding part.

PARAMETERS:

  • batch_size: default = 256, mini-batch size.
  • n_samples: default = 25, number of samplings of the posterior distribution of latent embedding. The denoised counts are average of the samplings.
  • peak_name: numpy array, shape (n_spots, 2), location information.

RETURNS:

  • Pandas dataframe. Columns are "spatial_score": reconstruction importance of the GP embedding part, "non_spatial_score": reconstruction importance of the Gaussian embedding part.

train_model:

Model training function.

PARAMETERS:

  • pos: numpy array, shape (n_spots, 2), location information.
  • ncounts: numpy array, shape (n_spots, n_peaks), preprocessed count matrix.
  • lr: default = 0.001, learning rate for AdamW optimizer.
  • weight_decay: default = 0.001, weight decay for AdamW optimizer.
  • maxiter: default = 2000, maximum number of iterations.
  • save_model: default = True, whether to save the model weights.
  • model_weights: default = "model.pt", file name to save the model weights.
  • print_kernel_scale: default = True, whether to print current kernel scale during training steps.

Sparse variational Gaussian process.

kernel_matrix:

Computes GP kernel matrix $K(x, y) for l-th dimension$.

PARAMETERS:

  • x: tensor, position vector x.
  • y: tensor, position vector y.
  • l: scalar, l-th dimension of GP embedding.
  • diag_only: whether or not to only compute diagonal terms of the kernel matrix.

RETURN:

  • kernel matrix

variational_loss:

Compute variational loss of Gaussian process ($L_H$ in the equation (4)) for the current mini-batch data.

PARAMETERS:

  • x: tensor, shape (batch, 2), auxiliary (location) information for current batch.
  • y: tensor, shape (batch, 1), latent mean vector for current dimension, output by the encoder network.
  • noise: tensor, shape (batch, 1), latent variance vector for current dimension, output by the encoder network.
  • mu_hat: tensor, posterior mean for current dimension (equation (5)).
  • A_hat: tensor, (diagonal of) posterior covariance matrix for current dimension (equation (5)).
  • l: scalar, l-th dimension of GP embedding.

RETURN:

  • sum_term, KL_term (variational loss = sum term + KL term)

approximate_posterior_params:

Compute posterior parameters for the current mini-batch data ($\boldsymbol{\mu}_b^l$ and $\boldsymbol{A}_b^l$ in equation (5))

PARAMETERS:

  • index_points_test: tensor, testing set of auxiliary (location) information.
  • index_points_train: tensor, training set of auxiliary (location) information.
  • y: tensor, shape (batch, 1), latent mean vector for current dimension, output by the encoder network.
  • noise: tensor, shape (batch, 1), latent variance vector for current dimension, output by the encoder network.
  • l: scalar, l-th dimension of GP embedding.

RETURN:

  • mean_vector, B: $\boldsymbol{m}_b^l$ and $\boldsymbol{B}_b^l$ in equation (7).
  • mu_hat, A_hat: $\boldsymbol{\mu}_b^l$ and $\boldsymbol{A}_b^l$ in equation (5).

Kernel functions.

MaternKernel:

Matern kernel. Default nu = 1.5.

  • forward: calculate $K(x,y)$. x, y are auxiliary (location) information.
  • forward_diag: calculate diagonal elements of $K(x,y)$. x, y are auxiliary (location) information.

MultiMaternKernel:

Matern kernel that allows each dimension has its own kernel scale. Default nu = 1.5.

  • forward: calculate $K(x,y)$ for l-th dimension. x, y are auxiliary (location) information.
  • forward_diag: calculate diagonal elements of $K(x,y)$ for l-th dimension. x, y are auxiliary (location) information.