Subject: Inquiry about the dataset generation methodology for benchmark reproducibility
Hello LLM-SRBench development team,
First and foremost, thank you for creating and sharing llm-srbench. It is a crucial and challenging benchmark for evaluating symbolic regression capabilities in LLMs, and a great contribution to the scientific community.
I am currently working to deeply understand and utilize the benchmark, and a key part of my work involves being able to programmatically reproduce the provided datasets. The goal is to ensure full transparency and allow for potential extensions or modifications for my own research.
However, I have encountered a systemic issue regarding data reproducibility. Across multiple datasets that I have examined (for example, the ones involving v = 2*mom/(q*r) and n = 2*pi*L/h), I find that I cannot achieve a bit-for-bit identical match with the ground truth values stored in the .h5 files when using the provided formulas.
I have conducted a thorough investigation to rule out common causes on my end, including:
- Testing various computational orders of operations.
- Systematically searching for potential intermediate rounding steps at different levels of precision.
The persistence of these minute discrepancies across different datasets leads me to believe there might be a specific, consistent methodology, script, or computational environment used to generate all the data for the benchmark.
My Questions:
To aid my work and to enhance the overall reproducibility of this valuable benchmark for all users, would it be possible for you to provide details on the data generation process? Specifically:
- Could you point me to the data generation script(s) within the repository?
- If scripts are not available, could you share some details about the environment used (e.g., Python version, key libraries like NumPy/SciPy/SymPy versions, hardware architecture)?
- Were there any global settings for numerical precision (e.g., float32 vs float64) or specific data type conversions applied during the process?
Having this information would be immensely beneficial for the community and would solidify llm-srbench's role as a standard for rigorous, reproducible research.
Thank you for your time and for all your hard work on this project.
Best regards,
Ziwen Zhang
Subject: Inquiry about the dataset generation methodology for benchmark reproducibility
Hello LLM-SRBench development team,
First and foremost, thank you for creating and sharing
llm-srbench. It is a crucial and challenging benchmark for evaluating symbolic regression capabilities in LLMs, and a great contribution to the scientific community.I am currently working to deeply understand and utilize the benchmark, and a key part of my work involves being able to programmatically reproduce the provided datasets. The goal is to ensure full transparency and allow for potential extensions or modifications for my own research.
However, I have encountered a systemic issue regarding data reproducibility. Across multiple datasets that I have examined (for example, the ones involving
v = 2*mom/(q*r)andn = 2*pi*L/h), I find that I cannot achieve a bit-for-bit identical match with the ground truth values stored in the.h5files when using the provided formulas.I have conducted a thorough investigation to rule out common causes on my end, including:
The persistence of these minute discrepancies across different datasets leads me to believe there might be a specific, consistent methodology, script, or computational environment used to generate all the data for the benchmark.
My Questions:
To aid my work and to enhance the overall reproducibility of this valuable benchmark for all users, would it be possible for you to provide details on the data generation process? Specifically:
Having this information would be immensely beneficial for the community and would solidify
llm-srbench's role as a standard for rigorous, reproducible research.Thank you for your time and for all your hard work on this project.
Best regards,
Ziwen Zhang