According to readme, the big-part of cuda memory use is in Aggregator. And the output of aggregator is not fully used. Only 4 layers of output_list are processed in following stages. You can safely discard the other layers.
intermediate_layer_idx: List[int] = [4, 11, 17, 23]