After following the example using a simple image and running it, I get a large list of meaningless letter combinations or word content and different results each time I run it.I see that there is the use of gpt2 and vit model parameters, which I assumed was a trained model and could be used directly.Or do I have to use [IAMFineTune.ipynb] trained locally before I can use it?