I've replaced the attention layers in Enformer with those in bigbird, but the memory usage calculated by tf.get_memory_info shows the usage is still basically the same (within 1%). I'm wondering if I need to include code from the encoder or decoder to see a decrease in memory usage?
Thanks!