Comcast
diff --git a/‎README.md
Lines changed: 122 additions & 1 deletion b/‎README.md
Lines changed: 122 additions & 1 deletion
diff --git a/‎xgitguard/common/configs_read.py
Lines changed: 129 additions & 29 deletions b/‎xgitguard/common/configs_read.py
Lines changed: 129 additions & 29 deletions
@@ -22,6 +22,7 @@ Designed and Developed by Comcast Cybersecurity Research and Development Team</p
 - [Usage](#usage)
   - [Enterprise Github Secrets Detection](#enterprise-github-secrets-detection)
   - [Public Github Secrets Detection](#public-github-secrets-detection)
+  - [FileScan](#filescan)
   - [ML Model Training](#ml-model-training)
   - [Custom Keyword Scan](#custom-keyword-scan)
 - [License](#license)
@@ -120,7 +121,7 @@ Designed and Developed by Comcast Cybersecurity Research and Development Team</p
   - url_validator: `https://github.<<`**`Enterprise_Name`**`>>.com/api/v3/search/code`
   - enterprise_commits_url: `https://github.<<`**`Enterprise_Name`**`>>.com/api/v3/repos/{user_name}/{repo_name}/commits?path={file_path}`
 
-#### Running Enterprise Secret Detection
+### Running Enterprise Secret Detection
 
 - Traverse into the `github-enterprise` script folder
 
@@ -515,6 +516,126 @@ Pass the Console Logging as Yes or No. Default is Yes
 
 > **Note:** By Default, the detected secrets will be masked to hide sensitive data. If needed, user can skip the masking to write raw secret using command line argument `-u Yes or --unmask_secret Yes`. Refer command line options for more details.
 
+### FileScan
+
+**Detecting Exposed Secrets on File System at Scale**
+
+- xGitGuard Filescanner detects secrets, such as keys and credentials, exposed on the filesystem.
+- Traverse into the `file-scanner` folder
+
+  ```
+  cd file-scanner
+  ```
+
+#### Running Extension Filter
+
+By default, the extension Search script runs for configured directories/files under config/xgg_search_paths.csv & config/extesnions.csv,
+
+```
+# Run with Default configs
+python xgg_extension_search.py
+```
+
+To run with specific directories or file path,
+
+```
+# Run with targetted directories/filepaths for all extensions
+python extension_search.py  -p "file-path"
+```
+
+To run with specific extensions & directories/filepaths,
+
+```
+# Run with targetted filepaths/directories for specific extensions
+python xgg_extension_search.py  -p "file-path" -e "py,txt"
+```
+
+> **Note:** By default extensions are picked from extensions.csv config file.But user can also search for targeted extensions either by proving in CLI option/updating extensions.csv
+
+##### Command-Line Arguments
+
+```
+Run usage:
+xgg_extension_search.py [-h] [-e Extensions] [-p Directory/Path] [-l Logger Level] [-c Console Logging]
+
+optional arguments:
+  -h, --help            show this help message and exit
+  -e Extensions, --extensions Extensions
+                          Pass the Extensions list as a comma-separated string
+  -p Search Path, --search_path Search Path/File
+                          Pass the Directory or file to be searched
+  -l Logger Level, --log_level Logger Level
+                          Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
+  -c Console Logging, --console_logging Console Logging
+                          Pass the Console Logging as Yes or No. Default is Yes
+```
+
+#### Search Output Format:
+
+##### Output Files
+
+```
+  1. Paths Detected: xgitguard\output\xgg_search_files.csv
+```
+
+#### Secrets Detection
+
+By default, the Secrets Detection script runs for given processed search paths(output/xgg_search_files.csv) with ML Filter detecting both keys and credentials.xGitGuard has an additional ML filter to reduce the false positives from the detection.
+
+```
+# Run with Default configs
+python secret_detection.py
+```
+
+##### Command to Run Scanner without ML Filter
+
+```
+# Run for given Searched Paths without ML model,
+python secret_detection.py -m No
+```
+
+##### Command-Line Arguments for Secret Scanner
+
+```
+Run usage:
+secret_detection.py [-h help] [-keys Secondary Keywords] [-creds Secondary Credentials] [-m Ml Prediction ] [-f File Path] [-model_pref model_preference] [-l Logger Level] [-c Console Logging]
+
+optional arguments:
+  -h, --help            show this help message and exit
+  -keys Secondary Keywords, --secondary_keywords Secondary Keywords
+                          Pass the Secondary Keyword as string
+  -creds Secondary Credentials, --secondary_Credentials
+                          Pass the Secondary Credentials as string
+  -m ML Prediction, --ml_prediction ML Prediction
+                          Pass the ML Filter as Yes or No. Default is Yes
+  -F File Path, --File_path Scan path of the  File
+                          Pass the file to be scanned
+  -model_preference, --model_preference
+                      Specify whether to use the public model or the enterprise model.Default is public
+  -l Logger Level, --log_level Logger Level
+                          Pass the Logging level as for CRITICAL - 50, ERROR - 40 WARNING - 30 INFO - 20 DEBUG - 10. Default is 20
+  -c Console Logging, --console_logging Console Logging
+                          Pass the Console Logging as Yes or No. Default is Yes
+```
+
+- Inputs used for search and scan
+
+  > **Note:** Command-line argument keywords have precedence over config files (Default). If no keywords are passed in cli, data from config files will be used for the search.
+
+  > **Note:** If ML Prediction flag is set to false the -model_preference flag is not required.
+
+  - xgg_search_files.csv file has a default list of file paths for search based on extension scan, which can be updated by users based on their requirement.
+
+#### Output Format:
+
+##### Output Files
+
+```
+  1. Secrets Detected: xgitguard\output\xgg_file_scan_*_secrets_detected.csv
+  2. Log File: xgitguard\logs\xgg_file_scan_*_secret_detection*yyyymmdd_hhmmss*.log
+  3. Hash File: xgitguard\output\xgg_file_scan_*_hashed_file.csv
+```
+
 #### ML Model Training
 
 #### Enterprise ML Model Training Procedure
 
@@ -50,9 +50,10 @@ def __init__(self):
 
     def read_xgg_configs(self, file_name):
         """
-        Read the given xgg_configs yaml file in config path
-        Set the Class Variable for further use
-        params: file_name - string
+        Read the given xgg_configs YAML file in the config path and set the class variable for further use.
+
+        Args:
+            file_name (str): The name of the configuration file.
         """
         logger.debug("<<<< 'Current Executing Function' >>>>")
         # Loading xgg_configs from xgg_configs_file
@@ -70,11 +71,13 @@ def read_xgg_configs(self, file_name):
 
     def read_primary_keywords(self, file_name):
         """
-        Read the given primary keywords csv file in config path
-        Set the Class Variable for further use
-        params: file_name - string
+        Read the given primary keywords CSV file in the config path and set the class variable for further use.
+
+        Args:
+            file_name (str): The name of the CSV file.
         """
         logger.debug("<<<< 'Current Executing Function' >>>>")
+
         # Loading primary keywords from primary keywords file
         self.primary_keywords_file = os.path.join(self.config_dir, file_name)
         self.primary_keywords = read_csv_file(
@@ -87,11 +90,13 @@ def read_primary_keywords(self, file_name):
 
     def read_secondary_keywords(self, file_name):
         """
-        Read the given secondary keywords csv file in config path
-        Set the Class Variable for further use
-        params: file_name - string
+        Read the given secondary keywords CSV file in the config directory and set the class variable for further use.
+
+        Args:
+            file_name (str): The name of the CSV file.
         """
         logger.debug("<<<< 'Current Executing Function' >>>>")
+
         # Loading secondary keywords from secondary keywords file
         self.secondary_keywords_file = os.path.join(self.config_dir, file_name)
         self.secondary_keywords = read_csv_file(
@@ -102,37 +107,63 @@ def read_secondary_keywords(self, file_name):
         ]
         # logger.debug(f"secondary_keywords: {self.secondary_keywords}")
 
+    def read_secondary_credentials(self, file_name):
+        """
+        Read the given secondary credentials CSV file in the config directory and set the class variable for further use.
+
+        Args:
+            file_name (str): The name of the CSV file.
+        """
+        logger.debug("<<<< 'Current Executing Function' >>>>")
+
+        # Loading secondary Credentials from secondary credentials file
+        self.secondary_credentials_file = os.path.join(self.config_dir, file_name)
+        self.secondary_credentials = read_csv_file(
+            self.secondary_credentials_file, output="list", header=0
+        )
+        self.secondary_credentials = [
+            item for sublist in self.secondary_credentials for item in sublist
+        ]
+        # logger.debug(f"secondary_credentials: {self.secondary_credentials}")
+
     def read_extensions(self, file_name="extensions.csv"):
         """
-        Read the given extensions csv file in config path
-        Set the Class Variable for further use
-        params: file_name - string
+        Read the given extensions CSV file in the config path and set the class variable for further use.
+
+        Args:
+            file_name (str): The name of the CSV file.
         """
         logger.debug("<<<< 'Current Executing Function' >>>>")
+
         # Get the extensions from extensions file
         self.extensions_file = os.path.join(self.config_dir, file_name)
         self.extensions = read_csv_file(self.extensions_file, output="list", header=0)
         self.extensions = [item for sublist in self.extensions for item in sublist]
+
         # logger.debug(f"Extensions: {self.extensions}")
 
     def read_hashed_url(self, file_name):
         """
-        Read the given hashed url csv file in output path
-        Set the Class Variable for further use
-        params: file_name - string
+        Read the given hashed URL CSV file in the output path and set the class variable for further use.
+
+        Args:
+            file_name (str): The name of the CSV file.
         """
         logger.debug("<<<< 'Current Executing Function' >>>>")
+
         # Loading Existing url hash detections
         self.hashed_url_file = os.path.join(self.output_dir, file_name)
         hashed_key_urls = read_csv_file(self.hashed_url_file, output="list", header=0)
         self.hashed_urls = [row[0] for row in hashed_key_urls]
+
         # logger.debug(f"hashed_urls: {self.hashed_urls}")
 
     def read_training_data(self, file_name):
         """
-        Read the given training data csv file in output path
-        Set the Class Variable for further use
-        params: file_name - string
+        Read the given training data CSV file in the output path and set the class variable for further use.
+
+        Args:
+            file_name (str): The name of the CSV file.
         """
         logger.debug("<<<< 'Current Executing Function' >>>>")
         self.training_data_file = os.path.join(self.output_dir, file_name)
@@ -151,9 +182,12 @@ def read_training_data(self, file_name):
 
     def read_confidence_values(self, file_name="confidence_values.csv"):
         """
-        Read the given confidence values csv file in config path
-        Set the key as index and the Class Variable for further use
-        params: file_name - string
+        Read the given confidence values CSV file in the config path and set the key as index.
+
+        This function sets the class variable for further use.
+
+        Args:
+            file_name (str): The name of the CSV file.
         """
         logger.debug("<<<< 'Current Executing Function' >>>>")
         # Loading confidence levels from file
@@ -178,10 +212,12 @@ def read_confidence_values(self, file_name="confidence_values.csv"):
 
     def read_dictionary_words(self, file_name="dictionary_words.csv"):
         """
-        Read the given dictionary words csv file in config path
-        Create dictionary similarity values
-        Set the Class Variables for further use
-        params: file_name - string
+        Read the given dictionary words CSV file in the config path.
+
+        This function creates dictionary similarity values and sets the class variables for further use.
+
+        Args:
+            file_name (str): The name of the CSV file.
         """
         logger.debug("<<<< 'Current Executing Function' >>>>")
         # Creating dictionary similarity values
@@ -216,9 +252,10 @@ def read_dictionary_words(self, file_name="dictionary_words.csv"):
 
     def read_stop_words(self, file_name="stop_words.csv"):
         """
-        Read the given stop words csv file in config path
-        Set the Class Variable for further use
-        params: file_name - string
+        Read the given stop words CSV file in the config path and set the class variable for further use.
+
+        Args:
+            file_name (str): The name of the CSV file.
         """
         logger.debug("<<<< 'Current Executing Function' >>>>")
         # Get the programming language stop words
@@ -227,6 +264,69 @@ def read_stop_words(self, file_name="stop_words.csv"):
         self.stop_words = [item for sublist in self.stop_words for item in sublist]
         # logger.debug(f"Total Stop Words: {len(self.stop_words)}")
 
+    def read_search_paths(self, file_name):
+        """
+        Read the given search paths CSV file in the config directory and set the class variable for further use.
+
+        Args:
+            file_name (str): The name of the CSV file.
+        """
+        logger.debug("<<<< 'Current Executing Function' >>>>")
+
+        # Loading the search paths file to retrieve the paths that need the extension filter applied
+        self.search_paths_file = os.path.join(self.config_dir, file_name)
+        self.search_paths = read_csv_file(
+            self.search_paths_file, output="list", header=0
+        )
+        self.search_paths = [item for sublist in self.search_paths for item in sublist]
+        # logger.debug(f"search_paths: {self.search_paths}")
+
+    def read_search_files(self, file_name):
+        """
+        Read the given search paths CSV file in the config directory and set the class variable for further use.
+
+        Args:
+            file_name (str): The name of the CSV file.
+        """
+        logger.debug("<<<< 'Current Executing Function' >>>>")
+
+        # Reading the paths of files to be searched after applying the extension filter
+        self.target_paths_file = os.path.join(self.output_dir, file_name)
+        self.search_files = read_csv_file(
+            self.target_paths_file, output="list", header=0
+        )
+        self.search_files = [item for sublist in self.search_files for item in sublist]
+        # logger.debug(f"search_files: {self.search_files}")
+
+    def read_hashed_file(self, file_name):
+        """
+        Read the given hashed file CSV file in the output path and set the class variable for further use.
+
+        Args:
+            file_name (str): The name of the CSV file.
+        """
+        logger.debug("<<<< 'Current Executing Function' >>>>")
+        # Loading Existing url hash detections
+        self.hashed_file = os.path.join(self.output_dir, file_name)
+        hashed_key_files = read_csv_file(self.hashed_file, output="", header=0)
+        try:
+            self.hashed_files = (
+                hashed_key_files.get("hashed_files").drop_duplicates().tolist()
+            )
+            self.hashed_file_modified_time = (
+                hashed_key_files.get("file_modification_hash")
+                .drop_duplicates()
+                .tolist()
+            )
+            self.hash_file_path = (
+                hashed_key_files.get("files").drop_duplicates().tolist()
+            )
+        except:
+            self.hashed_files = []
+            self.hashed_file_modified_time = []
+            self.hash_file_path = []
+        # logger.debug(f"hashed_urls: {self.hashed_urls}")
+
 
 if __name__ == "__main__":
 
@@ -239,4 +339,4 @@ def read_stop_words(self, file_name="stop_words.csv"):
     logger = create_logger(
         log_level=10, console_logging=True, log_dir=log_dir, log_file_name=log_file_name
     )
-    configs = ConfigsData()
+    configs = ConfigsData()