GitHub

This is the dataset used in an work accepted to the 17th International Conference on Agents and Artificial Intelligence (ICAART). The proposed article title is: Evaluating Biased Synthetic Data Effects on Large Language Model-based Software Vulnerability Detection.

Brief review of all techniques used to clean bias of the SARD Juliet C/C++ 1.3 dataset

A vulnerability delimiter is written between the vulnerable lines. <START> is placed immediately before the vulnerability and <END> is placed after.
The directives #ifndef are used to separate "good" and "bad" files. Each test case has a "good" and a "bad" file.
The functions and variables names that may bias the outcome are changed to a generic name, obtaining a symbolic representation of these variables/function. The variables are changed to VAR0, VAR1, ...etc, and the functions are changed to FUN0, FUN1, ...etc. Note that not all functions/variables are represented in this way, just those that have bias in its name, for example those that have "good" and "bad" in its name.
All comments are removed
Finally, we noticed that the classes (vuln. or not vuln.) are biased by some data inside the code by two patterns. In this final step we remove these two patterns before training the model, these patterns are the static function and the "cascade" pattern that we define at the end of this page.

Static function pattern:

These are two code snippets taken from the original SARD Juliet dataset:

/* bad function declaration */
void CWE121_Stack_Based_Buffer_Overflow_dest_char_alloca_cat_51b_badSink (char * data);
void CWE121_Stack_Based_Buffer_Overflow_dest_char_alloca_cat_51_bad()

/* good function declarations */
void CWE121_Stack_Based_Buffer_Overflow_dest_char_alloca_cat_51b_goodG2BSink (char * data);
static void goodG2B()

Basically we noticed that the non vulnerable files have a static void function, while the vulnerable ones don't. The static function appears on 99,7% of the non vulnerable files, while it appears only on 8% of the vulnerable files.

"Cascade" pattern:

We noticed that after some data processing, specifally after the symbolic representation step, that a pattern appeared at the end of each non vulnerable file:

void FUN2(){
    FUN0();
    FUN1();
}

The "cascade" pattern appears on 99,6% of the non vulnerable files, while it appears only on 0,01% of the vulnerable files.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
dataset		dataset
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Brief review of all techniques used to clean bias of the SARD Juliet C/C++ 1.3 dataset

About

Uh oh!

Releases

Packages

Languages

lucasg1/sard_dataset_without_bias

Folders and files

Latest commit

History

Repository files navigation

Brief review of all techniques used to clean bias of the SARD Juliet C/C++ 1.3 dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages