Missing Evaluation code and metric implementation

Hi,

Great work on this project!

I was going through the repo and noticed that the evaluation code and metric implementations do not seem to be included. Just wondering if those will be released, or if there is any guidance on how to reproduce the evaluation results?

Would be super helpful for benchmarking and comparing custom models.

Thanks!