Skip to content

Commit f5f2bca

Browse files
hritankarcodeunknownbabenek
authored
Remove unused MetaRow fields (#209)
* Remove unused MetaRow fields * Add migration script to clean meta CSV by dropping unused fields * .ci/benchmark.txt update * [skip actions] [tmp] 2025-05-31T12:48:02+03:00 * scanner code fix * readme fix * Apply unix style for lines in meta --------- Co-authored-by: unknown <your.email@example.com> Co-authored-by: Roman Babenko <babenek@gmail.com>
1 parent 7ab36e8 commit f5f2bca

File tree

320 files changed

+63233
-63239
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

320 files changed

+63233
-63239
lines changed

.ci/benchmark.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
META MD5 ab1a2c8bd0f74bc42c0f749dd323002e
1+
META MD5 4438c6ff6ddec934d205651ddfed047d
22
DATA MD5 d0c51cce420271d1e947e82fb0aa21f7
33
DATA: 16707548 interested lines. MARKUP: 62260 items
44
FileType FileNumber ValidLines Positives Negatives Templates

README.md

Lines changed: 3 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -142,20 +142,10 @@ Metadata includes Ground Truth values and additional information for credential
142142
| LineStart | Integer | Line start in file from 1, like in most editors. In common cases it equals LineEnd. |
143143
| LineEnd | Integer | End line of credential MUST be great or equal like LineStart. Sort line_data_list with line_num for this. |
144144
| GroundTruth | String | Ground Truth of this credential. True (T) / False (F) or Template |
145-
| WithWords | Boolean | Flag to indicate word(https://github.com/first20hours/google-10000-english) is included on the credential |
146145
| ValueStart | Integer | Index of value on the line like in CredSweeper report. This is start position on LineStart. Empty or -1 means the markup for whole line (for False only) |
147146
| ValueEnd | Integer | Index of character right after value ends in the line. This is end position on LineEnd. May be -1 or empty. |
148-
| InURL | Boolean | Flag to indicate if credential is a part of a URL, such as "http://user:pwd@site.com" |
149-
| InRuntimeParameter | Boolean | Flag to indicate if credential is in runtime parameter |
150-
| CharacterSet | String | Characters used in the credential (NumberOnly, CharOnly, Any) |
151147
| CryptographyKey | String | Type of a key: Private or Public |
152148
| PredefinedPattern | String | Credential with defined regex patterns (AWS token with `AKIA...` pattern) |
153-
| VariableNameType | String | Categorize credentials by variable name into Secret, Key, Token, SeedSalt and Auth |
154-
| Entropy | Float | Shanon entropy of a credential |
155-
| Length | Integer | Value length, similar to ValueEnd - ValueStart |
156-
| Base64Encode | Boolean | Is credential a base64 string? |
157-
| HexEncode | Boolean | Is credential a hex encoded string? (like `\xFF` or `FF 02 33`) |
158-
| URLEncode | Boolean | Is credential a url encoded string? (like `one%20two`) |
159149
| Category | String | Labeled data according CredSweeper rules. see [Category](#category). |
160150

161151
### Category
@@ -171,8 +161,8 @@ A single metadata file contains rows including line location, value index and GT
171161
Let's look at the [meta/02dfa7ec.csv](meta/02dfa7ec.csv). file as an example.
172162

173163
```
174-
Id,FileID,Domain,RepoName,FilePath,LineStart,LineEnd,GroundTruth,WithWords,ValueStart,ValueEnd,...
175-
34024,61ed9af5,GitHub,02dfa7ec,data/02dfa7ec/test/61ed9af5.example,83,83,T,F,31,73,...
164+
Id,FileID,Domain,RepoName,FilePath,LineStart,LineEnd,GroundTruth,ValueStart,ValueEnd,...
165+
34024,61ed9af5,GitHub,02dfa7ec,data/02dfa7ec/test/61ed9af5.example,83,83,T,31,73,...
176166
```
177167

178168
Convert the above line with only essential columns into a table format:
@@ -202,7 +192,7 @@ hfbpozfhvuwgtfosmo2imqskc73w04jf3313309829
202192
With them, you can use ``review_data.py`` script to review the markup in console with colorization.
203193

204194
<pre>
205-
34024,61ed9af5,GitHub,02dfa7ec,data/02dfa7ec/test/61ed9af5.example,83,83,T,F,31,73,F,F,Any,,,Secret,3.74,42,F,F,F,Secret:Token
195+
34024,61ed9af5,GitHub,02dfa7ec,data/02dfa7ec/test/61ed9af5.example,83,83,T,31,73,,,Secret:Token
206196
83:<font color="#00AA00"># GITHUB_ENTERPRISE_ORG_SECRET_TOKEN=</font><span style="background-color:#FFFF55"><font color="#00AA00">hfbpozfhvuwgtfosmo2imqskc73w04jf3313309829</font></span>
207197
</pre>
208198

benchmark/scanner/scanner.py

Lines changed: 2 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -262,8 +262,8 @@ def check_line_from_meta(self,
262262
approximate = f"{self.meta_next_id},{file_id}" \
263263
f",GitHub,{repo_name},{data_path}" \
264264
f",{line_start},{line_end}" \
265-
f",F,F,{value_start},{value_end}" \
266-
f",F,F,,,,,0.0,0,F,F,F,{rule}"
265+
f",F,{value_start},{value_end}" \
266+
f",,,{rule}"
267267
lost_meta = MetaRow({
268268
"Id": self.meta_next_id,
269269
"FileID": file_id,
@@ -273,20 +273,10 @@ def check_line_from_meta(self,
273273
"LineStart": line_start,
274274
"LineEnd": line_end,
275275
"GroundTruth": 'F',
276-
"WithWords": 'F',
277276
"ValueStart": value_start,
278277
"ValueEnd": value_end,
279-
"InURL": 'F',
280-
"InRuntimeParameter": 'F',
281-
"CharacterSet": '',
282278
"CryptographyKey": '',
283279
"PredefinedPattern": '',
284-
"VariableNameType": '',
285-
"Entropy": 0.0,
286-
"Length": 0,
287-
"Base64Encode": 'F',
288-
"HexEncode": 'F',
289-
"URLEncode": 'F',
290280
"Category": rule
291281
})
292282

0 commit comments

Comments
 (0)