This is the best inference code i've seen, and I'd like to call it from python.
Specifically I'd like to create a class that stores a reference to the model and then create a generate function that does the forward pass.
Could you point me to some examples on how to call the CUDA code from python? thanks!