Skip to content

Commit 88e8732

Browse files
authored
Label "Template" migration (#231)
* missed Bearer Authorization * migration script * additional info * migrated all meta * Try to use X' * rollback meta * the script * X * readme fix * readme fix * readme fix 2
1 parent 30969b5 commit 88e8732

File tree

238 files changed

+4316
-4299
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

238 files changed

+4316
-4299
lines changed

.ci/benchmark.txt

Lines changed: 132 additions & 132 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,10 @@
44
* [Introduction](#introduction)
55
* [How To Use](#how-to-use)
66
* [Data Overview](#data-overview)
7-
* [Data statistics](#data-statistics)
7+
* [Data statistics](#data-statistics)
88
* [Data](#data)
9-
* [Selecting Target Repositories](#selecting-target-repositories)
10-
* [Ground Rules for Labeling Suspected Credential Information](#ground-rules-for-labeling-suspected-credential-information)
9+
* [Selecting Target Repositories](#selecting-target-repositories)
10+
* [Ground Rules for Labeling Suspected Credential Information](#ground-rules-for-labeling-suspected-credential-information)
1111
* [Metadata](#metadata)
1212
* [Obfuscation](#obfuscation)
1313
* [License](#license)
@@ -89,11 +89,11 @@ please, find wide info in https://raw.githubusercontent.com/Samsung/CredSweeper/
8989

9090
## Data
9191
### Selecting Target Repositories
92-
In order to collect various cases in which credentials exist, we selected publicly accessible repositories on Github through the following process:
93-
1. We wanted to collect credentials from repositories for various languages, frameworks, and topics, so we primarily collected 181 topics on Github.
92+
In order to collect various cases in which credentials exist, we selected publicly accessible repositories on GitHub through the following process:
93+
1. We wanted to collect credentials from repositories for various languages, frameworks, and topics, so we primarily collected 181 topics on GitHub.
9494

9595
In this process, to select widely known repositories for each topic, we limited repositories with more than a certain number of stars. 19,486 repositories were selected in this process.
96-
2. We filtered repositories which have the license that can not be used for dataset according to the license information provided by Github.
96+
2. We filtered repositories which have the license that can not be used for dataset according to the license information provided by GitHub.
9797

9898
In some cases, the provided license was inaccurate. So we conducted with manual review.
9999
3. Filtering was carried out by checking whether strings related to the most common credentials such as 'password' and 'secret' among the result repositories are included and how many are included. After that, we executed several [open source credential scanning tools.](#used-tools-for-benchmarking)
@@ -106,21 +106,21 @@ It is difficult to know whether a line included in the source code is a real cre
106106
However, based on human cognitive abilities, we can expect the possibility that the detected result contains actual credential information.
107107
We classify the detection results to the three credential type.
108108

109-
- True : It looks like a real credential value.
110-
- False : It looks like a false positive case, not the actual credential value.
111-
- Template : It seems that it is not an actual credential, but it is a placeholder. It might be helpful in cases such as ML.
112-
109+
- **T** (True) : It looks like a real credential value.
110+
- **F** (False) : It looks like a false positive case, not the actual hardcoded credential value.
111+
- **X** (Unknown/Other) : It seems that it is not a real credential but a test value, an example, or it is a placeholder.
112+
113113
In order to compose an accurate Ground Truth set, we proceed data review based on the following 'Ground Rules':
114114
1. All credentials in test (example) directories should be labeled as True.
115115
2. Credentials with obvious placeholders (`password = <YOUR_PASSWORD>;`) should be labeled as False.
116116
3. Function calls without string literals (`password=getPass();`) and environmental variable assignments (`password=${pass}`) should be labeled as False.
117117
4. Base64 and other encoded data: the decision must be after research. Use True if original data contain are credentials.
118118
5. Package and resource version hash is not a credential, so common hash string (`integrity sha512-W7s+uC5bikET2twEFg==`) is False.
119119
6. Be careful about filetype when checking variable assignment:
120-
121120
In .yaml file row (`password=my_password`) can be a credential
122121
but in .js or .py it cannot. This languages require quotations (' or ") for string declaration (`password="my_password"`).
123122
7. Check if the file you are labeling is not a localization file. For example `config/locales/pt-BR.yml` is not a credentials, just a translation. So those should be labeled as False.
123+
8. Any possible markers like "example.com" mean the credential is False for ML rules. And may be True for not-ML rules.
124124

125125
> We could see that many credentials exist in directories/files that have the same test purpose as test/tests.
126126
> In the case of these values, people often judge that they contain a real credential, but we do not know whether this value is an actual usable credential or a value used only for testing purposes.
@@ -140,9 +140,9 @@ Metadata includes Ground Truth values and additional information for credential
140140
| FilePath | String | File path where credential information was included |
141141
| LineStart | Integer | Line start in file from 1, like in most editors. In common cases it equals LineEnd. |
142142
| LineEnd | Integer | End line of credential MUST be great or equal like LineStart. Sort line_data_list with line_num for this. |
143-
| GroundTruth | String | Ground Truth of this credential. True (T) / False (F) or Template |
143+
| GroundTruth | String | Ground Truth of this credential. True (T) / False (F,X) |
144144
| ValueStart | Integer | Index of value on the line like in CredSweeper report. This is start position on LineStart. Empty or -1 means the markup for whole line (for False only) |
145-
| ValueEnd | Integer | Index of character right after value ends in the line. This is end position on LineEnd. May be -1 or empty. |
145+
| ValueEnd | Integer | Index of character right after value ends in the line. This is end position on LineEnd. May be -1 or empty. |
146146
| CryptographyKey | String | Type of a key: Private or Public |
147147
| PredefinedPattern | String | Credential with defined regex patterns (AWS token with `AKIA...` pattern) |
148148
| Category | String | Labeled data according CredSweeper rules. see [Category](#category). |
@@ -202,15 +202,15 @@ If the suspicious lines are included in the dataset as it is, the credential val
202202
To avoid such cases we proceeded:
203203
1. Credential values obfuscation in files.
204204
2. Directory & file name and directory hierarchy obfuscation.
205-
205+
206206
### Credential values obfuscation in files
207207
To prevent leakage of the actual credential value in the file, we can mask the line that is supposed to be credential or change it to a random string.
208208
However, this masking and changing to a random string can make the side effects to the detection performance of several tools.
209209
We have applied other methods to substitute the actual credential values within the limits of ensuring the detectability of these various tools.
210210
- Replacing the real value to a example value for a case where a fixed pattern is clear (ex. AWS Access Key)
211211
- Replacing the entire file with credential information to a example file. (ex. X.509 Key)
212212
- Random key generation using regex pattern from the character set of real string and length.
213-
213+
214214
### Directory & file name and directory hierarchy obfuscation
215215
Even if the line suspected of having a credential in the file is obfuscated, you can easily check the original credential value and where it came from by the information of the repository (repo name, directory structure, file name).
216216
To prevent this from happening, we obfuscated the directory structure and file names.

benchmark/scanner/file_type_stat.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,3 @@ class FileTypeStat:
77
valid_lines: int
88
true_markup: int
99
false_markup: int
10-
template_markup: int

benchmark/scanner/scanner.py

Lines changed: 13 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -31,8 +31,7 @@ def __init__(self, scanner_type: ScannerType, scanner_url: str, working_dir: str
3131
self.result_dict: dict = {}
3232
self.total_true_cnt = 0
3333
self.total_false_cnt = 0
34-
self.total_template_cnt = 0
35-
self.rules_markup_counters: Dict[str, Tuple[int, int, int]] = {} # category: true_cnt, false_cnt, template_cnt
34+
self.rules_markup_counters: Dict[str, Tuple[int, int]] = {} # category: true_cnt, false_cnt
3635
self.meta_next_id = 0 # used in suggestion
3736
self.file_types: Dict[str, FileTypeStat] = {}
3837
self.total_data_valid_lines = 0
@@ -74,24 +73,20 @@ def _prepare_meta(self):
7473
# get file extension like in CredSweeper
7574
_, file_type = os.path.splitext(meta_row.FilePath)
7675
file_type_lower = file_type.lower()
77-
type_stat = self.file_types.get(file_type_lower, FileTypeStat(0, 0, 0, 0, 0))
76+
type_stat = self.file_types.get(file_type_lower, FileTypeStat(0, 0, 0, 0))
7877
rules = meta_row.Category.split(':')
7978
for rule in rules:
80-
true_cnt, false_cnt, template_cnt = self.rules_markup_counters.get(rule, (0, 0, 0))
79+
true_cnt, false_cnt = self.rules_markup_counters.get(rule, (0, 0))
8180
if 'T' == meta_row.GroundTruth:
8281
true_cnt += 1
8382
self.total_true_cnt += 1
8483
type_stat.true_markup += 1
85-
elif 'F' == meta_row.GroundTruth:
84+
else:
85+
# all not TRUE marked up cases are FALSE
8686
self.total_false_cnt += 1
8787
false_cnt += 1
8888
type_stat.false_markup += 1
89-
else:
90-
# "Template" - correctness will be checked in MetaRow
91-
self.total_template_cnt += 1
92-
template_cnt += 1
93-
type_stat.template_markup += 1
94-
self.rules_markup_counters[rule] = (true_cnt, false_cnt, template_cnt)
89+
self.rules_markup_counters[rule] = (true_cnt, false_cnt)
9590
self.file_types[file_type_lower] = type_stat
9691
if self.meta_next_id <= meta_row.Id:
9792
self.meta_next_id = meta_row.Id + 1
@@ -103,7 +98,7 @@ def _prepare_meta(self):
10398
for file in files:
10499
file_name, file_ext = os.path.splitext(str(file))
105100
file_ext_lower = file_ext.lower()
106-
file_type_stat = self.file_types.get(file_ext_lower, FileTypeStat(0, 0, 0, 0, 0))
101+
file_type_stat = self.file_types.get(file_ext_lower, FileTypeStat(0, 0, 0, 0))
107102
file_type_stat.files_number += 1
108103
self.file_types[file_ext_lower] = file_type_stat
109104
with open(os.path.join(root, file), "rb") as f:
@@ -129,26 +124,22 @@ def _prepare_meta(self):
129124
check_data_valid_lines = 0
130125
check_true_cnt = 0
131126
check_false_cnt = 0
132-
check_template_cnt = 0
133127
for key, val in self.file_types.items():
134128
types_rows.append([key,
135129
val.files_number or None,
136130
val.valid_lines or None,
137131
val.true_markup or None,
138-
val.false_markup or None,
139-
val.template_markup or None])
132+
val.false_markup or None])
140133
check_files_number += val.files_number
141134
check_data_valid_lines += val.valid_lines
142135
check_true_cnt += val.true_markup
143136
check_false_cnt += val.false_markup
144-
check_template_cnt += val.template_markup
145137
types_rows.sort()
146138
types_rows.append(["TOTAL:",
147139
check_files_number,
148140
check_data_valid_lines,
149141
check_true_cnt,
150-
check_false_cnt,
151-
check_template_cnt])
142+
check_false_cnt])
152143
print(tabulate.tabulate(types_rows, types_headers), flush=True)
153144

154145
@property
@@ -274,7 +265,7 @@ def check_line_from_meta(self,
274265
"FilePath": data_path,
275266
"LineStart": line_start,
276267
"LineEnd": line_end,
277-
"GroundTruth": 'F',
268+
"GroundTruth": 'X',
278269
"ValueStart": value_start,
279270
"ValueEnd": value_end,
280271
"CryptographyKey": '',
@@ -347,7 +338,7 @@ def check_line_from_meta(self,
347338
self.true_cnt += 1
348339
return LineStatus.FALSE, repo_name, file_id
349340
else:
350-
# MetaRow class checks the correctness of row.GroundTruth = ['T', 'F', "Template"]
341+
# MetaRow class checks the correctness of row.GroundTruth = ['T', 'F']
351342
self._increase_result_dict_cnt(meta_rule, False)
352343
self.false_cnt += 1
353344
return LineStatus.TRUE, repo_name, file_id
@@ -377,7 +368,7 @@ def analyze_result(self) -> None:
377368
f", true_cnt : {self.true_cnt}, false_cnt : {self.false_cnt}"
378369
)
379370

380-
header = ["Rules", "Positives", "Negatives", "Templates", "Reported",
371+
header = ["Rules", "Positives", "Negatives", "Reported",
381372
"TP", "FP", "TN", "FN", "FPR", "FNR", "ACC", "PRC", "RCL", "F1"]
382373
rows: List[List[Any]] = []
383374

@@ -399,12 +390,11 @@ def analyze_result(self) -> None:
399390
total_true_cnt, total_false_cnt = self._get_total_true_false_count(rule)
400391
result = Result(true_cnt, false_cnt, total_true_cnt, total_false_cnt)
401392
if rule not in self.rules_markup_counters:
402-
self.rules_markup_counters[rule] = (0, 0, 0)
393+
self.rules_markup_counters[rule] = (0, 0)
403394
rows.append([
404395
rule,
405396
self.rules_markup_counters[rule][0],
406397
self.rules_markup_counters[rule][1],
407-
self.rules_markup_counters[rule][2],
408398
self.reported.get(rule),
409399
result.true_positive,
410400
result.false_positive,
@@ -430,7 +420,6 @@ def analyze_result(self) -> None:
430420
"",
431421
self.total_true_cnt,
432422
self.total_false_cnt,
433-
self.total_template_cnt,
434423
reported_sum,
435424
total_result.true_positive,
436425
total_result.false_positive,

meta/02dfa7ec.csv

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -16,25 +16,25 @@ Id,FileID,Domain,RepoName,FilePath,LineStart,LineEnd,GroundTruth,ValueStart,Valu
1616
28832,61ed9af5,GitHub,02dfa7ec,data/02dfa7ec/test/61ed9af5.example,74,74,T,32,73,,,Auth:Secret
1717
29294,61ed9af5,GitHub,02dfa7ec,data/02dfa7ec/test/61ed9af5.example,82,82,T,28,50,,,Key
1818
29502,e43ec22b,GitHub,02dfa7ec,data/02dfa7ec/test/e43ec22b.py,15,15,T,25,41,,,Password
19-
29657,a758153b,GitHub,02dfa7ec,data/02dfa7ec/test/a758153b.example,70,70,Template,18,33,,,Auth:Token
20-
30366,5fb5a784,GitHub,02dfa7ec,data/02dfa7ec/src/5fb5a784.env,50,50,Template,16,36,,,API:Gitlab Feed Token:Key
21-
31246,a758153b,GitHub,02dfa7ec,data/02dfa7ec/test/a758153b.example,25,25,Template,18,30,,,Key:Secret
19+
29657,a758153b,GitHub,02dfa7ec,data/02dfa7ec/test/a758153b.example,70,70,X,18,33,,,Auth:Token
20+
30366,5fb5a784,GitHub,02dfa7ec,data/02dfa7ec/src/5fb5a784.env,50,50,X,16,36,,,API:Gitlab Feed Token:Key
21+
31246,a758153b,GitHub,02dfa7ec,data/02dfa7ec/test/a758153b.example,25,25,X,18,30,,,Key:Secret
2222
32637,a758153b,GitHub,02dfa7ec,data/02dfa7ec/test/a758153b.example,94,94,T,30,35,,,URL Credentials
2323
32675,a758153b,GitHub,02dfa7ec,data/02dfa7ec/test/a758153b.example,22,22,T,27,44,,,URL Credentials
24-
32701,a758153b,GitHub,02dfa7ec,data/02dfa7ec/test/a758153b.example,47,47,Template,15,28,,,Password
24+
32701,a758153b,GitHub,02dfa7ec,data/02dfa7ec/test/a758153b.example,47,47,X,15,28,,,Password
2525
32757,61ed9af5,GitHub,02dfa7ec,data/02dfa7ec/test/61ed9af5.example,28,28,T,18,82,,,Key:Secret:Bitbucket Client Secret
2626
34024,61ed9af5,GitHub,02dfa7ec,data/02dfa7ec/test/61ed9af5.example,83,83,T,31,73,,,Secret
2727
35963,3279f9a1,GitHub,02dfa7ec,data/02dfa7ec/src/3279f9a1.py,100,100,T,26,90,,,Bitbucket Client Secret
28-
36233,3f04cb2a,GitHub,02dfa7ec,data/02dfa7ec/test/3f04cb2a.py,32,32,Template,29,33,,,Password
29-
37642,3f04cb2a,GitHub,02dfa7ec,data/02dfa7ec/test/3f04cb2a.py,49,49,Template,29,33,,,Password
30-
37643,3f04cb2a,GitHub,02dfa7ec,data/02dfa7ec/test/3f04cb2a.py,60,60,Template,29,33,,,Password
31-
37644,3f04cb2a,GitHub,02dfa7ec,data/02dfa7ec/test/3f04cb2a.py,70,70,Template,29,33,,,Password
32-
37645,3f04cb2a,GitHub,02dfa7ec,data/02dfa7ec/test/3f04cb2a.py,83,83,Template,29,33,,,Password
28+
36233,3f04cb2a,GitHub,02dfa7ec,data/02dfa7ec/test/3f04cb2a.py,32,32,X,29,33,,,Password
29+
37642,3f04cb2a,GitHub,02dfa7ec,data/02dfa7ec/test/3f04cb2a.py,49,49,X,29,33,,,Password
30+
37643,3f04cb2a,GitHub,02dfa7ec,data/02dfa7ec/test/3f04cb2a.py,60,60,X,29,33,,,Password
31+
37644,3f04cb2a,GitHub,02dfa7ec,data/02dfa7ec/test/3f04cb2a.py,70,70,X,29,33,,,Password
32+
37645,3f04cb2a,GitHub,02dfa7ec,data/02dfa7ec/test/3f04cb2a.py,83,83,X,29,33,,,Password
3333
50371,a5c0c9aa,GitHub,02dfa7ec,data/02dfa7ec/test/a5c0c9aa.py,60,60,T,25,41,,,Password
34-
58202,a758153b,GitHub,02dfa7ec,data/02dfa7ec/test/a758153b.example,61,61,Template,16,36,,,API:Gitlab Feed Token:Key
35-
58203,61ed9af5,GitHub,02dfa7ec,data/02dfa7ec/test/61ed9af5.example,43,43,Template,16,36,,,API:Gitlab Feed Token:Key
36-
62037,61ed9af5,GitHub,02dfa7ec,data/02dfa7ec/test/61ed9af5.example,55,55,Template,18,33,,,Auth:Token
37-
62038,5fb5a784,GitHub,02dfa7ec,data/02dfa7ec/src/5fb5a784.env,65,65,Template,18,33,,,Auth:Token
34+
58202,a758153b,GitHub,02dfa7ec,data/02dfa7ec/test/a758153b.example,61,61,X,16,36,,,API:Gitlab Feed Token:Key
35+
58203,61ed9af5,GitHub,02dfa7ec,data/02dfa7ec/test/61ed9af5.example,43,43,X,16,36,,,API:Gitlab Feed Token:Key
36+
62037,61ed9af5,GitHub,02dfa7ec,data/02dfa7ec/test/61ed9af5.example,55,55,X,18,33,,,Auth:Token
37+
62038,5fb5a784,GitHub,02dfa7ec,data/02dfa7ec/src/5fb5a784.env,65,65,X,18,33,,,Auth:Token
3838
64018,a5c0c9aa,GitHub,02dfa7ec,data/02dfa7ec/test/a5c0c9aa.py,537,537,F,-1,-1,,,Key
3939
64019,a5c0c9aa,GitHub,02dfa7ec,data/02dfa7ec/test/a5c0c9aa.py,551,551,F,-1,-1,,,Key
4040
74230,0f25fb09,GitHub,02dfa7ec,data/02dfa7ec/src/0f25fb09.js,262,262,F,-1,-1,,,Other

meta/0436af4a.csv

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ Id,FileID,Domain,RepoName,FilePath,LineStart,LineEnd,GroundTruth,ValueStart,Valu
66
71444,4433e6a4,GitHub,0436af4a,data/0436af4a/test/4433e6a4.cs,210,210,T,37,47,,,Auth
77
71445,17ed1848,GitHub,0436af4a,data/0436af4a/other/17ed1848.md,318,318,T,58,66,,,URL Credentials
88
71446,4663e076,GitHub,0436af4a,data/0436af4a/other/4663e076.md,39,39,T,26,36,,,Credential
9-
71447,4663e076,GitHub,0436af4a,data/0436af4a/other/4663e076.md,65,65,Template,26,31,,,Credential
9+
71447,4663e076,GitHub,0436af4a,data/0436af4a/other/4663e076.md,65,65,X,26,31,,,Credential
1010
71448,a9108d0e,GitHub,0436af4a,data/0436af4a/other/a9108d0e.md,730,730,F,37,50,,,Credential
1111
71449,94b788c0,GitHub,0436af4a,data/0436af4a/other/94b788c0.md,77,77,T,20,56,,,Auth:UUID
1212
71450,94b788c0,GitHub,0436af4a,data/0436af4a/other/94b788c0.md,78,78,T,24,44,,,Secret:Auth
@@ -84,7 +84,7 @@ Id,FileID,Domain,RepoName,FilePath,LineStart,LineEnd,GroundTruth,ValueStart,Valu
8484
71525,c9fde51c,GitHub,0436af4a,data/0436af4a/test/c9fde51c.cs,97,97,T,36,43,,,Password
8585
71526,d197c0ec,GitHub,0436af4a,data/0436af4a/test/d197c0ec.cs,131,131,T,44,58,,,Password
8686
71527,d197c0ec,GitHub,0436af4a,data/0436af4a/test/d197c0ec.cs,132,132,T,44,58,,,Password
87-
71528,d2b2603a,GitHub,0436af4a,data/0436af4a/test/d2b2603a.cs,25,25,Template,36,43,,,Key
87+
71528,d2b2603a,GitHub,0436af4a,data/0436af4a/test/d2b2603a.cs,25,25,X,36,43,,,Key
8888
71529,d2b2603a,GitHub,0436af4a,data/0436af4a/test/d2b2603a.cs,150,150,T,41,49,,,Key
8989
71530,df627f1b,GitHub,0436af4a,data/0436af4a/test/df627f1b.cs,169,169,T,38,48,,,Password
9090
71531,df627f1b,GitHub,0436af4a,data/0436af4a/test/df627f1b.cs,170,170,T,38,49,,,Password

0 commit comments

Comments
 (0)