I reproduced your code completely, but when I calculated the index, I found that my accuracy index was 0.01. It only answered one question correctly in the test set. When tag is 1, is it accurate to judge whether gt==answer? Can you provide your evaluation.py file?