The evolution of the different mutation types indeed does seem quite distinct. Nevertheless, one could imagine that they could all share a common 3mer embedding, and then one could have a more complex NN downstream for rate estimation.
This would be a 4-fold reduction in the number of parameters for the embedding step. Right now we have 16embedding_dim parameters per mutation type, of which there are 16. A model with common embedding would have 64embedding_dim parameters for the embedding, full stop.