Samsung
diff --git a/‎.ci/benchmark.txt‎
Lines changed: 132 additions & 132 deletions b/‎.ci/benchmark.txt‎
Lines changed: 132 additions & 132 deletions
diff --git a/‎README.md‎
Lines changed: 15 additions & 15 deletions b/‎README.md‎
Lines changed: 15 additions & 15 deletions
diff --git a/‎benchmark/scanner/file_type_stat.py‎
Lines changed: 0 additions & 1 deletion b/‎benchmark/scanner/file_type_stat.py‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎benchmark/scanner/scanner.py‎
Lines changed: 13 additions & 24 deletions b/‎benchmark/scanner/scanner.py‎
Lines changed: 13 additions & 24 deletions
diff --git a/‎meta/02dfa7ec.csv‎
Lines changed: 13 additions & 13 deletions b/‎meta/02dfa7ec.csv‎
Lines changed: 13 additions & 13 deletions
diff --git a/‎meta/0436af4a.csv‎
Lines changed: 2 additions & 2 deletions b/‎meta/0436af4a.csv‎
Lines changed: 2 additions & 2 deletions
@@ -4,10 +4,10 @@
    * [Introduction](#introduction)
    * [How To Use](#how-to-use)
    * [Data Overview](#data-overview)
-	  * [Data statistics](#data-statistics)
+     * [Data statistics](#data-statistics)
    * [Data](#data)
-	  * [Selecting Target Repositories](#selecting-target-repositories)
-	  * [Ground Rules for Labeling Suspected Credential Information](#ground-rules-for-labeling-suspected-credential-information)
+     * [Selecting Target Repositories](#selecting-target-repositories)
+     * [Ground Rules for Labeling Suspected Credential Information](#ground-rules-for-labeling-suspected-credential-information)
    * [Metadata](#metadata)
    * [Obfuscation](#obfuscation)
    * [License](#license)
@@ -89,11 +89,11 @@ please, find wide info in https://raw.githubusercontent.com/Samsung/CredSweeper/
 
 ## Data
 ### Selecting Target Repositories
-In order to collect various cases in which credentials exist, we selected publicly accessible repositories on Github through the following process:
-1. We wanted to collect credentials from repositories for various languages, frameworks, and topics, so we primarily collected 181 topics on Github.
+In order to collect various cases in which credentials exist, we selected publicly accessible repositories on GitHub through the following process:
+1. We wanted to collect credentials from repositories for various languages, frameworks, and topics, so we primarily collected 181 topics on GitHub.
 
    In this process, to select widely known repositories for each topic, we limited repositories with more than a certain number of stars. 19,486 repositories were selected in this process.
-2. We filtered repositories which have the license that can not be used for dataset according to the license information provided by Github. 
+2. We filtered repositories which have the license that can not be used for dataset according to the license information provided by GitHub. 
 
    In some cases, the provided license was inaccurate. So we conducted with manual review.
 3. Filtering was carried out by checking whether strings related to the most common credentials such as 'password' and 'secret' among the result repositories are included and how many are included. After that, we executed several [open source credential scanning tools.](#used-tools-for-benchmarking)
@@ -106,21 +106,21 @@ It is difficult to know whether a line included in the source code is a real cre
 However, based on human cognitive abilities, we can expect the possibility that the detected result contains actual credential information.
 We classify the detection results to the three credential type.
 
-- True : It looks like a real credential value.
-- False : It looks like a false positive case, not the actual credential value.
-- Template : It seems that it is not an actual credential, but it is a placeholder. It might be helpful in cases such as ML.
-		
+- **T** (True) : It looks like a real credential value.
+- **F** (False) : It looks like a false positive case, not the actual hardcoded credential value.
+- **X** (Unknown/Other) : It seems that it is not a real credential but a test value, an example, or it is a placeholder.
+
 In order to compose an accurate Ground Truth set, we proceed data review based on the following 'Ground Rules':
 1. All credentials in test (example) directories should be labeled as True.
 2. Credentials with obvious placeholders (`password = <YOUR_PASSWORD>;`) should be labeled as False.
 3. Function calls without string literals (`password=getPass();`) and environmental variable assignments (`password=${pass}`) should be labeled as False.
 4. Base64 and other encoded data: the decision must be after research. Use True if original data contain are credentials. 
 5. Package and resource version hash is not a credential, so common hash string (`integrity sha512-W7s+uC5bikET2twEFg==`) is False.
 6. Be careful about filetype when checking variable assignment:
-   
    In .yaml file row (`password=my_password`) can be a credential 
    but in .js or .py it cannot. This languages require quotations (' or ") for string declaration (`password="my_password"`).
 7. Check if the file you are labeling is not a localization file. For example `config/locales/pt-BR.yml` is not a credentials, just a translation. So those should be labeled as False.
+8. Any possible markers like "example.com" mean the credential is False for ML rules. And may be True for not-ML rules.
 
 > We could see that many credentials exist in directories/files that have the same test purpose as test/tests.
 > In the case of these values, people often judge that they contain a real credential, but we do not know whether this value is an actual usable credential or a value used only for testing purposes.
@@ -140,9 +140,9 @@ Metadata includes Ground Truth values and additional information for credential
 | FilePath           | String    | File path where credential information was included                                                                                                      |
 | LineStart          | Integer   | Line start in file from 1, like in most editors. In common cases it equals LineEnd.                                                                      |
 | LineEnd            | Integer   | End line of credential MUST be great or equal like LineStart. Sort line_data_list with line_num for this.                                                |
-| GroundTruth        | String    | Ground Truth of this credential. True (T) / False (F) or Template                                                                                        |
+| GroundTruth        | String    | Ground Truth of this credential. True (T) / False (F,X)                                                                                                  |
 | ValueStart         | Integer   | Index of value on the line like in CredSweeper report. This is start position on LineStart. Empty or -1 means the markup for whole line (for False only) |
-| ValueEnd           | Integer   | Index of character right after value ends in the line. This is end position on LineEnd. May be -1 or empty.                               |
+| ValueEnd           | Integer   | Index of character right after value ends in the line. This is end position on LineEnd. May be -1 or empty.                                              |
 | CryptographyKey    | String    | Type of a key: Private or Public                                                                                                                         |
 | PredefinedPattern  | String    | Credential with defined regex patterns (AWS token with `AKIA...` pattern)                                                                                |
 | Category           | String    | Labeled data according CredSweeper rules. see [Category](#category).                                                                                     |
@@ -202,15 +202,15 @@ If the suspicious lines are included in the dataset as it is, the credential val
 To avoid such cases we proceeded:
 1. Credential values obfuscation in files.
 2. Directory & file name and directory hierarchy obfuscation.
-	
+
 ### Credential values obfuscation in files
 To prevent leakage of the actual credential value in the file, we can mask the line that is supposed to be credential or change it to a random string.
 However, this masking and changing to a random string can make the side effects to the detection performance of several tools.
 We have applied other methods to substitute the actual credential values within the limits of ensuring the detectability of these various tools.
 - Replacing the real value to a example value for a case where a fixed pattern is clear (ex. AWS Access Key)
 - Replacing the entire file with credential information to a example file. (ex. X.509 Key)
 - Random key generation using regex pattern from the character set of real string and length.
-		
+
 ### Directory & file name and directory hierarchy obfuscation
 Even if the line suspected of having a credential in the file is obfuscated, you can easily check the original credential value and where it came from by the information of the repository (repo name, directory structure, file name).
 To prevent this from happening, we obfuscated the directory structure and file names.
 
@@ -7,4 +7,3 @@ class FileTypeStat:
     valid_lines: int
     true_markup: int
     false_markup: int
-    template_markup: int
@@ -31,8 +31,7 @@ def __init__(self, scanner_type: ScannerType, scanner_url: str, working_dir: str
         self.result_dict: dict = {}
         self.total_true_cnt = 0
         self.total_false_cnt = 0
-        self.total_template_cnt = 0
-        self.rules_markup_counters: Dict[str, Tuple[int, int, int]] = {}  # category: true_cnt, false_cnt, template_cnt
+        self.rules_markup_counters: Dict[str, Tuple[int, int]] = {}  # category: true_cnt, false_cnt
         self.meta_next_id = 0  # used in suggestion
         self.file_types: Dict[str, FileTypeStat] = {}
         self.total_data_valid_lines = 0
@@ -74,24 +73,20 @@ def _prepare_meta(self):
             # get file extension like in CredSweeper
             _, file_type = os.path.splitext(meta_row.FilePath)
             file_type_lower = file_type.lower()
-            type_stat = self.file_types.get(file_type_lower, FileTypeStat(0, 0, 0, 0, 0))
+            type_stat = self.file_types.get(file_type_lower, FileTypeStat(0, 0, 0, 0))
             rules = meta_row.Category.split(':')
             for rule in rules:
-                true_cnt, false_cnt, template_cnt = self.rules_markup_counters.get(rule, (0, 0, 0))
+                true_cnt, false_cnt = self.rules_markup_counters.get(rule, (0, 0))
                 if 'T' == meta_row.GroundTruth:
                     true_cnt += 1
                     self.total_true_cnt += 1
                     type_stat.true_markup += 1
-                elif 'F' == meta_row.GroundTruth:
+                else:
+                    # all not TRUE marked up cases are FALSE
                     self.total_false_cnt += 1
                     false_cnt += 1
                     type_stat.false_markup += 1
-                else:
-                    # "Template" - correctness will be checked in MetaRow
-                    self.total_template_cnt += 1
-                    template_cnt += 1
-                    type_stat.template_markup += 1
-                self.rules_markup_counters[rule] = (true_cnt, false_cnt, template_cnt)
+                self.rules_markup_counters[rule] = (true_cnt, false_cnt)
             self.file_types[file_type_lower] = type_stat
             if self.meta_next_id <= meta_row.Id:
                 self.meta_next_id = meta_row.Id + 1
@@ -103,7 +98,7 @@ def _prepare_meta(self):
             for file in files:
                 file_name, file_ext = os.path.splitext(str(file))
                 file_ext_lower = file_ext.lower()
-                file_type_stat = self.file_types.get(file_ext_lower, FileTypeStat(0, 0, 0, 0, 0))
+                file_type_stat = self.file_types.get(file_ext_lower, FileTypeStat(0, 0, 0, 0))
                 file_type_stat.files_number += 1
                 self.file_types[file_ext_lower] = file_type_stat
                 with open(os.path.join(root, file), "rb") as f:
@@ -129,26 +124,22 @@ def _prepare_meta(self):
         check_data_valid_lines = 0
         check_true_cnt = 0
         check_false_cnt = 0
-        check_template_cnt = 0
         for key, val in self.file_types.items():
             types_rows.append([key,
                                val.files_number or None,
                                val.valid_lines or None,
                                val.true_markup or None,
-                               val.false_markup or None,
-                               val.template_markup or None])
+                               val.false_markup or None])
             check_files_number += val.files_number
             check_data_valid_lines += val.valid_lines
             check_true_cnt += val.true_markup
             check_false_cnt += val.false_markup
-            check_template_cnt += val.template_markup
         types_rows.sort()
         types_rows.append(["TOTAL:",
                            check_files_number,
                            check_data_valid_lines,
                            check_true_cnt,
-                           check_false_cnt,
-                           check_template_cnt])
+                           check_false_cnt])
         print(tabulate.tabulate(types_rows, types_headers), flush=True)
 
     @property
@@ -274,7 +265,7 @@ def check_line_from_meta(self,
             "FilePath": data_path,
             "LineStart": line_start,
             "LineEnd": line_end,
-            "GroundTruth": 'F',
+            "GroundTruth": 'X',
             "ValueStart": value_start,
             "ValueEnd": value_end,
             "CryptographyKey": '',
@@ -347,7 +338,7 @@ def check_line_from_meta(self,
                         self.true_cnt += 1
                         return LineStatus.FALSE, repo_name, file_id
                     else:
-                        # MetaRow class checks the correctness of row.GroundTruth = ['T', 'F', "Template"]
+                        # MetaRow class checks the correctness of row.GroundTruth = ['T', 'F']
                         self._increase_result_dict_cnt(meta_rule, False)
                         self.false_cnt += 1
                         return LineStatus.TRUE, repo_name, file_id
@@ -377,7 +368,7 @@ def analyze_result(self) -> None:
             f", true_cnt : {self.true_cnt}, false_cnt : {self.false_cnt}"
         )
 
-        header = ["Rules", "Positives", "Negatives", "Templates", "Reported",
+        header = ["Rules", "Positives", "Negatives", "Reported",
                   "TP", "FP", "TN", "FN", "FPR", "FNR", "ACC", "PRC", "RCL", "F1"]
         rows: List[List[Any]] = []
 
@@ -399,12 +390,11 @@ def analyze_result(self) -> None:
             total_true_cnt, total_false_cnt = self._get_total_true_false_count(rule)
             result = Result(true_cnt, false_cnt, total_true_cnt, total_false_cnt)
             if rule not in self.rules_markup_counters:
-                self.rules_markup_counters[rule] = (0, 0, 0)
+                self.rules_markup_counters[rule] = (0, 0)
             rows.append([
                 rule,
                 self.rules_markup_counters[rule][0],
                 self.rules_markup_counters[rule][1],
-                self.rules_markup_counters[rule][2],
                 self.reported.get(rule),
                 result.true_positive,
                 result.false_positive,
@@ -430,7 +420,6 @@ def analyze_result(self) -> None:
             "",
             self.total_true_cnt,
             self.total_false_cnt,
-            self.total_template_cnt,
             reported_sum,
             total_result.true_positive,
             total_result.false_positive,
 
@@ -16,25 +16,25 @@ Id,FileID,Domain,RepoName,FilePath,LineStart,LineEnd,GroundTruth,ValueStart,Valu
 28832,61ed9af5,GitHub,02dfa7ec,data/02dfa7ec/test/61ed9af5.example,74,74,T,32,73,,,Auth:Secret
 29294,61ed9af5,GitHub,02dfa7ec,data/02dfa7ec/test/61ed9af5.example,82,82,T,28,50,,,Key
 29502,e43ec22b,GitHub,02dfa7ec,data/02dfa7ec/test/e43ec22b.py,15,15,T,25,41,,,Password
-29657,a758153b,GitHub,02dfa7ec,data/02dfa7ec/test/a758153b.example,70,70,Template,18,33,,,Auth:Token
-30366,5fb5a784,GitHub,02dfa7ec,data/02dfa7ec/src/5fb5a784.env,50,50,Template,16,36,,,API:Gitlab Feed Token:Key
-31246,a758153b,GitHub,02dfa7ec,data/02dfa7ec/test/a758153b.example,25,25,Template,18,30,,,Key:Secret
+29657,a758153b,GitHub,02dfa7ec,data/02dfa7ec/test/a758153b.example,70,70,X,18,33,,,Auth:Token
+30366,5fb5a784,GitHub,02dfa7ec,data/02dfa7ec/src/5fb5a784.env,50,50,X,16,36,,,API:Gitlab Feed Token:Key
+31246,a758153b,GitHub,02dfa7ec,data/02dfa7ec/test/a758153b.example,25,25,X,18,30,,,Key:Secret
 32637,a758153b,GitHub,02dfa7ec,data/02dfa7ec/test/a758153b.example,94,94,T,30,35,,,URL Credentials
 32675,a758153b,GitHub,02dfa7ec,data/02dfa7ec/test/a758153b.example,22,22,T,27,44,,,URL Credentials
-32701,a758153b,GitHub,02dfa7ec,data/02dfa7ec/test/a758153b.example,47,47,Template,15,28,,,Password
+32701,a758153b,GitHub,02dfa7ec,data/02dfa7ec/test/a758153b.example,47,47,X,15,28,,,Password
 32757,61ed9af5,GitHub,02dfa7ec,data/02dfa7ec/test/61ed9af5.example,28,28,T,18,82,,,Key:Secret:Bitbucket Client Secret
 34024,61ed9af5,GitHub,02dfa7ec,data/02dfa7ec/test/61ed9af5.example,83,83,T,31,73,,,Secret
 35963,3279f9a1,GitHub,02dfa7ec,data/02dfa7ec/src/3279f9a1.py,100,100,T,26,90,,,Bitbucket Client Secret
-36233,3f04cb2a,GitHub,02dfa7ec,data/02dfa7ec/test/3f04cb2a.py,32,32,Template,29,33,,,Password
-37642,3f04cb2a,GitHub,02dfa7ec,data/02dfa7ec/test/3f04cb2a.py,49,49,Template,29,33,,,Password
-37643,3f04cb2a,GitHub,02dfa7ec,data/02dfa7ec/test/3f04cb2a.py,60,60,Template,29,33,,,Password
-37644,3f04cb2a,GitHub,02dfa7ec,data/02dfa7ec/test/3f04cb2a.py,70,70,Template,29,33,,,Password
-37645,3f04cb2a,GitHub,02dfa7ec,data/02dfa7ec/test/3f04cb2a.py,83,83,Template,29,33,,,Password
+36233,3f04cb2a,GitHub,02dfa7ec,data/02dfa7ec/test/3f04cb2a.py,32,32,X,29,33,,,Password
+37642,3f04cb2a,GitHub,02dfa7ec,data/02dfa7ec/test/3f04cb2a.py,49,49,X,29,33,,,Password
+37643,3f04cb2a,GitHub,02dfa7ec,data/02dfa7ec/test/3f04cb2a.py,60,60,X,29,33,,,Password
+37644,3f04cb2a,GitHub,02dfa7ec,data/02dfa7ec/test/3f04cb2a.py,70,70,X,29,33,,,Password
+37645,3f04cb2a,GitHub,02dfa7ec,data/02dfa7ec/test/3f04cb2a.py,83,83,X,29,33,,,Password
 50371,a5c0c9aa,GitHub,02dfa7ec,data/02dfa7ec/test/a5c0c9aa.py,60,60,T,25,41,,,Password
-58202,a758153b,GitHub,02dfa7ec,data/02dfa7ec/test/a758153b.example,61,61,Template,16,36,,,API:Gitlab Feed Token:Key
-58203,61ed9af5,GitHub,02dfa7ec,data/02dfa7ec/test/61ed9af5.example,43,43,Template,16,36,,,API:Gitlab Feed Token:Key
-62037,61ed9af5,GitHub,02dfa7ec,data/02dfa7ec/test/61ed9af5.example,55,55,Template,18,33,,,Auth:Token
-62038,5fb5a784,GitHub,02dfa7ec,data/02dfa7ec/src/5fb5a784.env,65,65,Template,18,33,,,Auth:Token
+58202,a758153b,GitHub,02dfa7ec,data/02dfa7ec/test/a758153b.example,61,61,X,16,36,,,API:Gitlab Feed Token:Key
+58203,61ed9af5,GitHub,02dfa7ec,data/02dfa7ec/test/61ed9af5.example,43,43,X,16,36,,,API:Gitlab Feed Token:Key
+62037,61ed9af5,GitHub,02dfa7ec,data/02dfa7ec/test/61ed9af5.example,55,55,X,18,33,,,Auth:Token
+62038,5fb5a784,GitHub,02dfa7ec,data/02dfa7ec/src/5fb5a784.env,65,65,X,18,33,,,Auth:Token
 64018,a5c0c9aa,GitHub,02dfa7ec,data/02dfa7ec/test/a5c0c9aa.py,537,537,F,-1,-1,,,Key
 64019,a5c0c9aa,GitHub,02dfa7ec,data/02dfa7ec/test/a5c0c9aa.py,551,551,F,-1,-1,,,Key
 74230,0f25fb09,GitHub,02dfa7ec,data/02dfa7ec/src/0f25fb09.js,262,262,F,-1,-1,,,Other
 
@@ -6,7 +6,7 @@ Id,FileID,Domain,RepoName,FilePath,LineStart,LineEnd,GroundTruth,ValueStart,Valu
 71444,4433e6a4,GitHub,0436af4a,data/0436af4a/test/4433e6a4.cs,210,210,T,37,47,,,Auth
 71445,17ed1848,GitHub,0436af4a,data/0436af4a/other/17ed1848.md,318,318,T,58,66,,,URL Credentials
 71446,4663e076,GitHub,0436af4a,data/0436af4a/other/4663e076.md,39,39,T,26,36,,,Credential
-71447,4663e076,GitHub,0436af4a,data/0436af4a/other/4663e076.md,65,65,Template,26,31,,,Credential
+71447,4663e076,GitHub,0436af4a,data/0436af4a/other/4663e076.md,65,65,X,26,31,,,Credential
 71448,a9108d0e,GitHub,0436af4a,data/0436af4a/other/a9108d0e.md,730,730,F,37,50,,,Credential
 71449,94b788c0,GitHub,0436af4a,data/0436af4a/other/94b788c0.md,77,77,T,20,56,,,Auth:UUID
 71450,94b788c0,GitHub,0436af4a,data/0436af4a/other/94b788c0.md,78,78,T,24,44,,,Secret:Auth
@@ -84,7 +84,7 @@ Id,FileID,Domain,RepoName,FilePath,LineStart,LineEnd,GroundTruth,ValueStart,Valu
 71525,c9fde51c,GitHub,0436af4a,data/0436af4a/test/c9fde51c.cs,97,97,T,36,43,,,Password
 71526,d197c0ec,GitHub,0436af4a,data/0436af4a/test/d197c0ec.cs,131,131,T,44,58,,,Password
 71527,d197c0ec,GitHub,0436af4a,data/0436af4a/test/d197c0ec.cs,132,132,T,44,58,,,Password
-71528,d2b2603a,GitHub,0436af4a,data/0436af4a/test/d2b2603a.cs,25,25,Template,36,43,,,Key
+71528,d2b2603a,GitHub,0436af4a,data/0436af4a/test/d2b2603a.cs,25,25,X,36,43,,,Key
 71529,d2b2603a,GitHub,0436af4a,data/0436af4a/test/d2b2603a.cs,150,150,T,41,49,,,Key
 71530,df627f1b,GitHub,0436af4a,data/0436af4a/test/df627f1b.cs,169,169,T,38,48,,,Password
 71531,df627f1b,GitHub,0436af4a,data/0436af4a/test/df627f1b.cs,170,170,T,38,49,,,Password