Skip to content

Commit

Permalink
Merge pull request #45 from nishabalamurugan/file-scanner
Browse files Browse the repository at this point in the history
Integrated filescanner
  • Loading branch information
sai100 authored Nov 29, 2024
2 parents 5852e04 + c351098 commit ecc115d
Show file tree
Hide file tree
Showing 9 changed files with 1,602 additions and 103 deletions.
123 changes: 122 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Designed and Developed by Comcast Cybersecurity Research and Development Team</p
- [Usage](#usage)
- [Enterprise Github Secrets Detection](#enterprise-github-secrets-detection)
- [Public Github Secrets Detection](#public-github-secrets-detection)
- [FileScan](#filescan)
- [ML Model Training](#ml-model-training)
- [Custom Keyword Scan](#custom-keyword-scan)
- [License](#license)
Expand Down Expand Up @@ -120,7 +121,7 @@ Designed and Developed by Comcast Cybersecurity Research and Development Team</p
- url_validator: `https://github.<<`**`Enterprise_Name`**`>>.com/api/v3/search/code`
- enterprise_commits_url: `https://github.<<`**`Enterprise_Name`**`>>.com/api/v3/repos/{user_name}/{repo_name}/commits?path={file_path}`

#### Running Enterprise Secret Detection
### Running Enterprise Secret Detection

- Traverse into the `github-enterprise` script folder

Expand Down Expand Up @@ -515,6 +516,126 @@ Pass the Console Logging as Yes or No. Default is Yes

> **Note:** By Default, the detected secrets will be masked to hide sensitive data. If needed, user can skip the masking to write raw secret using command line argument `-u Yes or --unmask_secret Yes`. Refer command line options for more details.
### FileScan

**Detecting Exposed Secrets on File System at Scale**

- xGitGuard Filescanner detects secrets, such as keys and credentials, exposed on the filesystem.
- Traverse into the `file-scanner` folder

```
cd file-scanner
```

#### Running Extension Filter

By default, the extension Search script runs for configured directories/files under config/xgg_search_paths.csv & config/extesnions.csv,

```
# Run with Default configs
python xgg_extension_search.py
```

To run with specific directories or file path,

```
# Run with targetted directories/filepaths for all extensions
python extension_search.py -p "file-path"
```

To run with specific extensions & directories/filepaths,

```
# Run with targetted filepaths/directories for specific extensions
python xgg_extension_search.py -p "file-path" -e "py,txt"
```

> **Note:** By default extensions are picked from extensions.csv config file.But user can also search for targeted extensions either by proving in CLI option/updating extensions.csv
##### Command-Line Arguments

```
Run usage:
xgg_extension_search.py [-h] [-e Extensions] [-p Directory/Path] [-l Logger Level] [-c Console Logging]
optional arguments:
-h, --help show this help message and exit
-e Extensions, --extensions Extensions
Pass the Extensions list as a comma-separated string
-p Search Path, --search_path Search Path/File
Pass the Directory or file to be searched
-l Logger Level, --log_level Logger Level
Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
-c Console Logging, --console_logging Console Logging
Pass the Console Logging as Yes or No. Default is Yes
```

#### Search Output Format:

##### Output Files

```
1. Paths Detected: xgitguard\output\xgg_search_files.csv
```

#### Secrets Detection

By default, the Secrets Detection script runs for given processed search paths(output/xgg_search_files.csv) with ML Filter detecting both keys and credentials.xGitGuard has an additional ML filter to reduce the false positives from the detection.

```
# Run with Default configs
python secret_detection.py
```

##### Command to Run Scanner without ML Filter

```
# Run for given Searched Paths without ML model,
python secret_detection.py -m No
```

##### Command-Line Arguments for Secret Scanner

```
Run usage:
secret_detection.py [-h help] [-keys Secondary Keywords] [-creds Secondary Credentials] [-m Ml Prediction ] [-f File Path] [-model_pref model_preference] [-l Logger Level] [-c Console Logging]
optional arguments:
-h, --help show this help message and exit
-keys Secondary Keywords, --secondary_keywords Secondary Keywords
Pass the Secondary Keyword as string
-creds Secondary Credentials, --secondary_Credentials
Pass the Secondary Credentials as string
-m ML Prediction, --ml_prediction ML Prediction
Pass the ML Filter as Yes or No. Default is Yes
-F File Path, --File_path Scan path of the File
Pass the file to be scanned
-model_preference, --model_preference
Specify whether to use the public model or the enterprise model.Default is public
-l Logger Level, --log_level Logger Level
Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
-c Console Logging, --console_logging Console Logging
Pass the Console Logging as Yes or No. Default is Yes
```

- Inputs used for search and scan

> **Note:** Command-line argument keywords have precedence over config files (Default). If no keywords are passed in cli, data from config files will be used for the search.
> **Note:** If ML Prediction flag is set to false the -model_preference flag is not required.
- xgg_search_files.csv file has a default list of file paths for search based on extension scan, which can be updated by users based on their requirement.

#### Output Format:

##### Output Files

```
1. Secrets Detected: xgitguard\output\xgg_file_scan_*_secrets_detected.csv
2. Log File: xgitguard\logs\xgg_file_scan_*_secret_detection*yyyymmdd_hhmmss*.log
3. Hash File: xgitguard\output\xgg_file_scan_*_hashed_file.csv
```

#### ML Model Training

#### Enterprise ML Model Training Procedure
Expand Down
158 changes: 129 additions & 29 deletions xgitguard/common/configs_read.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,9 +50,10 @@ def __init__(self):

def read_xgg_configs(self, file_name):
"""
Read the given xgg_configs yaml file in config path
Set the Class Variable for further use
params: file_name - string
Read the given xgg_configs YAML file in the config path and set the class variable for further use.
Args:
file_name (str): The name of the configuration file.
"""
logger.debug("<<<< 'Current Executing Function' >>>>")
# Loading xgg_configs from xgg_configs_file
Expand All @@ -70,11 +71,13 @@ def read_xgg_configs(self, file_name):

def read_primary_keywords(self, file_name):
"""
Read the given primary keywords csv file in config path
Set the Class Variable for further use
params: file_name - string
Read the given primary keywords CSV file in the config path and set the class variable for further use.
Args:
file_name (str): The name of the CSV file.
"""
logger.debug("<<<< 'Current Executing Function' >>>>")

# Loading primary keywords from primary keywords file
self.primary_keywords_file = os.path.join(self.config_dir, file_name)
self.primary_keywords = read_csv_file(
Expand All @@ -87,11 +90,13 @@ def read_primary_keywords(self, file_name):

def read_secondary_keywords(self, file_name):
"""
Read the given secondary keywords csv file in config path
Set the Class Variable for further use
params: file_name - string
Read the given secondary keywords CSV file in the config directory and set the class variable for further use.
Args:
file_name (str): The name of the CSV file.
"""
logger.debug("<<<< 'Current Executing Function' >>>>")

# Loading secondary keywords from secondary keywords file
self.secondary_keywords_file = os.path.join(self.config_dir, file_name)
self.secondary_keywords = read_csv_file(
Expand All @@ -102,37 +107,63 @@ def read_secondary_keywords(self, file_name):
]
# logger.debug(f"secondary_keywords: {self.secondary_keywords}")

def read_secondary_credentials(self, file_name):
"""
Read the given secondary credentials CSV file in the config directory and set the class variable for further use.
Args:
file_name (str): The name of the CSV file.
"""
logger.debug("<<<< 'Current Executing Function' >>>>")

# Loading secondary Credentials from secondary credentials file
self.secondary_credentials_file = os.path.join(self.config_dir, file_name)
self.secondary_credentials = read_csv_file(
self.secondary_credentials_file, output="list", header=0
)
self.secondary_credentials = [
item for sublist in self.secondary_credentials for item in sublist
]
# logger.debug(f"secondary_credentials: {self.secondary_credentials}")

def read_extensions(self, file_name="extensions.csv"):
"""
Read the given extensions csv file in config path
Set the Class Variable for further use
params: file_name - string
Read the given extensions CSV file in the config path and set the class variable for further use.
Args:
file_name (str): The name of the CSV file.
"""
logger.debug("<<<< 'Current Executing Function' >>>>")

# Get the extensions from extensions file
self.extensions_file = os.path.join(self.config_dir, file_name)
self.extensions = read_csv_file(self.extensions_file, output="list", header=0)
self.extensions = [item for sublist in self.extensions for item in sublist]

# logger.debug(f"Extensions: {self.extensions}")

def read_hashed_url(self, file_name):
"""
Read the given hashed url csv file in output path
Set the Class Variable for further use
params: file_name - string
Read the given hashed URL CSV file in the output path and set the class variable for further use.
Args:
file_name (str): The name of the CSV file.
"""
logger.debug("<<<< 'Current Executing Function' >>>>")

# Loading Existing url hash detections
self.hashed_url_file = os.path.join(self.output_dir, file_name)
hashed_key_urls = read_csv_file(self.hashed_url_file, output="list", header=0)
self.hashed_urls = [row[0] for row in hashed_key_urls]

# logger.debug(f"hashed_urls: {self.hashed_urls}")

def read_training_data(self, file_name):
"""
Read the given training data csv file in output path
Set the Class Variable for further use
params: file_name - string
Read the given training data CSV file in the output path and set the class variable for further use.
Args:
file_name (str): The name of the CSV file.
"""
logger.debug("<<<< 'Current Executing Function' >>>>")
self.training_data_file = os.path.join(self.output_dir, file_name)
Expand All @@ -151,9 +182,12 @@ def read_training_data(self, file_name):

def read_confidence_values(self, file_name="confidence_values.csv"):
"""
Read the given confidence values csv file in config path
Set the key as index and the Class Variable for further use
params: file_name - string
Read the given confidence values CSV file in the config path and set the key as index.
This function sets the class variable for further use.
Args:
file_name (str): The name of the CSV file.
"""
logger.debug("<<<< 'Current Executing Function' >>>>")
# Loading confidence levels from file
Expand All @@ -178,10 +212,12 @@ def read_confidence_values(self, file_name="confidence_values.csv"):

def read_dictionary_words(self, file_name="dictionary_words.csv"):
"""
Read the given dictionary words csv file in config path
Create dictionary similarity values
Set the Class Variables for further use
params: file_name - string
Read the given dictionary words CSV file in the config path.
This function creates dictionary similarity values and sets the class variables for further use.
Args:
file_name (str): The name of the CSV file.
"""
logger.debug("<<<< 'Current Executing Function' >>>>")
# Creating dictionary similarity values
Expand Down Expand Up @@ -216,9 +252,10 @@ def read_dictionary_words(self, file_name="dictionary_words.csv"):

def read_stop_words(self, file_name="stop_words.csv"):
"""
Read the given stop words csv file in config path
Set the Class Variable for further use
params: file_name - string
Read the given stop words CSV file in the config path and set the class variable for further use.
Args:
file_name (str): The name of the CSV file.
"""
logger.debug("<<<< 'Current Executing Function' >>>>")
# Get the programming language stop words
Expand All @@ -227,6 +264,69 @@ def read_stop_words(self, file_name="stop_words.csv"):
self.stop_words = [item for sublist in self.stop_words for item in sublist]
# logger.debug(f"Total Stop Words: {len(self.stop_words)}")

def read_search_paths(self, file_name):
"""
Read the given search paths CSV file in the config directory and set the class variable for further use.
Args:
file_name (str): The name of the CSV file.
"""
logger.debug("<<<< 'Current Executing Function' >>>>")

# Loading the search paths file to retrieve the paths that need the extension filter applied
self.search_paths_file = os.path.join(self.config_dir, file_name)
self.search_paths = read_csv_file(
self.search_paths_file, output="list", header=0
)
self.search_paths = [item for sublist in self.search_paths for item in sublist]
# logger.debug(f"search_paths: {self.search_paths}")

def read_search_files(self, file_name):
"""
Read the given search paths CSV file in the config directory and set the class variable for further use.
Args:
file_name (str): The name of the CSV file.
"""
logger.debug("<<<< 'Current Executing Function' >>>>")

# Reading the paths of files to be searched after applying the extension filter
self.target_paths_file = os.path.join(self.output_dir, file_name)
self.search_files = read_csv_file(
self.target_paths_file, output="list", header=0
)
self.search_files = [item for sublist in self.search_files for item in sublist]
# logger.debug(f"search_files: {self.search_files}")

def read_hashed_file(self, file_name):
"""
Read the given hashed file CSV file in the output path and set the class variable for further use.
Args:
file_name (str): The name of the CSV file.
"""
logger.debug("<<<< 'Current Executing Function' >>>>")
# Loading Existing url hash detections
self.hashed_file = os.path.join(self.output_dir, file_name)
hashed_key_files = read_csv_file(self.hashed_file, output="", header=0)
try:
self.hashed_files = (
hashed_key_files.get("hashed_files").drop_duplicates().tolist()
)
self.hashed_file_modified_time = (
hashed_key_files.get("file_modification_hash")
.drop_duplicates()
.tolist()
)
self.hash_file_path = (
hashed_key_files.get("files").drop_duplicates().tolist()
)
except:
self.hashed_files = []
self.hashed_file_modified_time = []
self.hash_file_path = []
# logger.debug(f"hashed_urls: {self.hashed_urls}")


if __name__ == "__main__":

Expand All @@ -239,4 +339,4 @@ def read_stop_words(self, file_name="stop_words.csv"):
logger = create_logger(
log_level=10, console_logging=True, log_dir=log_dir, log_file_name=log_file_name
)
configs = ConfigsData()
configs = ConfigsData()
Loading

0 comments on commit ecc115d

Please sign in to comment.