Refine triaging logic, improve web crash details and add better "TP/FP" indicator

There are some weird attributes in the "Bug" column for benchmark results, which makes it a bit difficult to understand the meaning of it. I also think the use of `Bug` is a bit too vague in this context. The "Bug" column is supposed to indicate whether the bug is in the driver code or the project code.

Consider the following result: https://llm-exp.oss-fuzz.com/Result-reports/ofg-pr/2024-09-29-655-d-cov-103-all/benchmark/output-htslib-sam_index_build2/index.html

The two results that crash have:

- Triaging --> `Driver` for both
- Diagnosis --> Semantics vs non-semantic
- Crashes --> `True` for both

Both are clearly false positives, but "Bug" is set to opposites between them.

`Bug`, in the report, is defined: https://github.com/google/oss-fuzz-gen/blob/d26a523eb04e4a7178595f56428b38d5cb4e8c32/report/templates/benchmark.html#L45

i.e. `sample.result.crashes and not sample.result.is_semantic_error`

So if there is no semantic error and the issue crashes, bug becomes `True` and colored red.

Based on the definition in template, I think `True` means the bug is considered a valid bug. The color coding makes me a bit confusing though -- my intuition would be to have it green if it was a true positive.

I think we should do a couple of improvements here:

1) Rename `Bug` to be a bit more descriptive
2) Add the classification logic to the core rather than in the web app itself (i.e. include `sample.result.crashes and not sample.result.is_semantic_error` in the core)
3) Add the LLM-based triage verdict into the conclusion of the a bug is a TP or FP
4) Include all semantic validations in the crash triaging logic, and show them all in the UI. The logic starting here: https://github.com/google/oss-fuzz-gen/blob/d26a523eb04e4a7178595f56428b38d5cb4e8c32/experiment/builder_runner.py#L362-L365 only includes a single semantic validation. However, the semantic checks are not mutually exclusive, and in many cases it would be good to know them all, e.g. "the crash is a NULL-deref (https://github.com/google/oss-fuzz-gen/blob/d26a523eb04e4a7178595f56428b38d5cb4e8c32/experiment/builder_runner.py#L370) *and* happens in the first iterations of the running (https://github.com/google/oss-fuzz-gen/blob/d26a523eb04e4a7178595f56428b38d5cb4e8c32/experiment/builder_runner.py#L410) *and* in a trace close to the harness (https://github.com/google/oss-fuzz-gen/blob/d26a523eb04e4a7178595f56428b38d5cb4e8c32/experiment/builder_runner.py#L419)
5) based on the above improvements come up with a new definition of "True Positive vs False Positive", e.g. based on a more fine-grained scoring system

	symptom = SemanticCheckResult.extract_symptom(fuzzlog)
	crash_stacks = self._parse_stacks_from_libfuzzer_logs(lines)
	crash_func = self._parse_func_from_stacks(project_name, crash_stacks)
	crash_info = SemanticCheckResult.extract_crash_info(fuzzlog)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refine triaging logic, improve web crash details and add better "TP/FP" indicator #656

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Refine triaging logic, improve web crash details and add better "TP/FP" indicator #656

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions