Need help with honesty benchmark against vanilla llama2

Hey lab. I am working on a POC with a customer support AI company. They ask us to provide honesty benchmark to use data to prove we are more honest than the vanilla llama2. Do we have the public data set and the test result? If not, could we use https://people.eecs.berkeley.edu/~normanmu/llm_rules/ to test? It seems the pipeline is different and we can not use as is.