Skip to content

Critical reproducibility issue: Missing decoder parameters in ChartQAPro baseline results #5

@utkuu-cerebras

Description

@utkuu-cerebras

Hi,
First of all, thanks for your work. We are currently attempting to reproduce the baseline results from the ChartQAPro paper, but we cannot match with some the results posted. We believe that this is because of key decoder parameters are omitted from both the paper and the provided implementation code. For instance, here are our results (using the provided prompts and evaluation script) for Microsoft-Phi-3.5-Vision-4B with Direct Prompting (took the mean of 3 trials):

  • Factoid: 14.82% (your result is 17.48%)
  • Conversational: 21.91% (your result is 28.54%)
  • Hypothetical: 31.10% (your result is 37.27%)
  • Fact Checking: 37.24% (your result is 41.99%)
  • Multi Choice: 28.01% (your result is 30.37%)

As you can see above, we are far away from your posted results

Could you please release the full config for decoding with values for such as but not limited to

  • max_tokens
  • top_p
  • frequency_penalty
  • presence_penalty
  • temperature
    and also the experiment settings (are the posted results best of n experiments, mean of n experiments, or one shot) so that your claimed results are reproducible. If we can have access to the inference settings, we can produce closer results.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions