Can't reproduce result.

Hi, I'm recently try to reproduce your result. I use Deepseek-V3 as the base model. It seems that raw V3 have 41.89% EM on total CWQ test dataset, and Think-on-Graph somewhat gives 45%.
Could you give some advice about that? Thanks!