question about loss function

Hi, great work! I noticed that although the paper mentions a KL divergence term in the loss (Eq. 12), it doesn’t seem to be implemented in the training code—was it actually used in practice?