Skip to content

Refine triaging logic, improve web crash details and add better "TP/FP" indicator #656

Open
@DavidKorczynski

Description

There are some weird attributes in the "Bug" column for benchmark results, which makes it a bit difficult to understand the meaning of it. I also think the use of Bug is a bit too vague in this context. The "Bug" column is supposed to indicate whether the bug is in the driver code or the project code.

Consider the following result: https://llm-exp.oss-fuzz.com/Result-reports/ofg-pr/2024-09-29-655-d-cov-103-all/benchmark/output-htslib-sam_index_build2/index.html

The two results that crash have:

  • Triaging --> Driver for both
  • Diagnosis --> Semantics vs non-semantic
  • Crashes --> True for both

Both are clearly false positives, but "Bug" is set to opposites between them.

Bug, in the report, is defined:

<td style="color: {{ 'red' if sample.result.crashes and not sample.result.is_semantic_error else 'black' }}">{{ sample.result.crashes and not sample.result.is_semantic_error }}</td>

i.e. sample.result.crashes and not sample.result.is_semantic_error

So if there is no semantic error and the issue crashes, bug becomes True and colored red.

Based on the definition in template, I think True means the bug is considered a valid bug. The color coding makes me a bit confusing though -- my intuition would be to have it green if it was a true positive.

I think we should do a couple of improvements here:

  1. Rename Bug to be a bit more descriptive
  2. Add the classification logic to the core rather than in the web app itself (i.e. include sample.result.crashes and not sample.result.is_semantic_error in the core)
  3. Add the LLM-based triage verdict into the conclusion of the a bug is a TP or FP
  4. Include all semantic validations in the crash triaging logic, and show them all in the UI. The logic starting here:
    symptom = SemanticCheckResult.extract_symptom(fuzzlog)
    crash_stacks = self._parse_stacks_from_libfuzzer_logs(lines)
    crash_func = self._parse_func_from_stacks(project_name, crash_stacks)
    crash_info = SemanticCheckResult.extract_crash_info(fuzzlog)
    only includes a single semantic validation. However, the semantic checks are not mutually exclusive, and in many cases it would be good to know them all, e.g. "the crash is a NULL-deref (
    if symptom == 'null-deref':
    ) and happens in the first iterations of the running (
    if lastround is None or lastround <= EARLY_FUZZING_ROUND_THRESHOLD:
    ) and in a trace close to the harness (
    if len(crash_stacks) > 0:
    )
  5. based on the above improvements come up with a new definition of "True Positive vs False Positive", e.g. based on a more fine-grained scoring system

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions