Description
There are some weird attributes in the "Bug" column for benchmark results, which makes it a bit difficult to understand the meaning of it. I also think the use of Bug
is a bit too vague in this context. The "Bug" column is supposed to indicate whether the bug is in the driver code or the project code.
Consider the following result: https://llm-exp.oss-fuzz.com/Result-reports/ofg-pr/2024-09-29-655-d-cov-103-all/benchmark/output-htslib-sam_index_build2/index.html
The two results that crash have:
- Triaging -->
Driver
for both - Diagnosis --> Semantics vs non-semantic
- Crashes -->
True
for both
Both are clearly false positives, but "Bug" is set to opposites between them.
Bug
, in the report, is defined:
i.e. sample.result.crashes and not sample.result.is_semantic_error
So if there is no semantic error and the issue crashes, bug becomes True
and colored red.
Based on the definition in template, I think True
means the bug is considered a valid bug. The color coding makes me a bit confusing though -- my intuition would be to have it green if it was a true positive.
I think we should do a couple of improvements here:
- Rename
Bug
to be a bit more descriptive - Add the classification logic to the core rather than in the web app itself (i.e. include
sample.result.crashes and not sample.result.is_semantic_error
in the core) - Add the LLM-based triage verdict into the conclusion of the a bug is a TP or FP
- Include all semantic validations in the crash triaging logic, and show them all in the UI. The logic starting here:
oss-fuzz-gen/experiment/builder_runner.py
Lines 362 to 365 in d26a523
oss-fuzz-gen/experiment/builder_runner.py
Line 370 in d26a523
oss-fuzz-gen/experiment/builder_runner.py
Line 410 in d26a523
oss-fuzz-gen/experiment/builder_runner.py
Line 419 in d26a523
- based on the above improvements come up with a new definition of "True Positive vs False Positive", e.g. based on a more fine-grained scoring system