Dear authors,
Thank you for sharing this awesome project!
I am trying to look deeper on the evaluation result. like, how exactly the model perform on different testing samples.
I successfully run over the example code you provided on README.md and got pretty good results. However, those results are limited to some high-level metrics.
So, I am trying to look deeper to the performance on each testing samples, to uncover some clues about:
- how the testing samples actually looks like to human?
- what is the performance of the model on each testing sample? and what are their recommended movies based on the historical dialog.
Do you know how can I manually check the what the model is actually taking as input and output for each testing sample?
Thanks in advance!
Sincerely,