- Python 3.7+
- Dask
- Dask-ML
- scikit-learn
- pandas
- pyarrow
- tqdm
- requests
-
Clone the repository:
git clone https://github.com/yourusername/dat503.git cd dat503
-
Install the required packages:
pip install -r requirements.txt
-
Configure the logging settings in
dat503.py
:def configure_logging(): """Configure logging settings.""" logging.basicConfig( filename='dat503.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s - %(filename)s:%(lineno)d', filemode='w' # Overwrite the log file on each run ) logging.captureWarnings(True)
-
Set the constants in
dat503.py
:BASE_URL = "https://opentransportdata.swiss/wp-content/uploads/ist-daten-archive/" TRAIN_FOLDER = os.path.join(os.path.dirname(__file__), 'data', 'train') FORCE_DOWNLOAD = False # Set to True to download the data NUM_MONTHS = 3 # Number of months to download TRAIN_FILTERS = {'LINIEN_TEXT': ['IC2', 'IC3', 'IC5', 'IC6', 'IC8', 'IC21']} TRAIN_OUTPUT_FILE_PATH = os.path.join(TRAIN_FOLDER, 'working', 'processed_data.parquet') TRAIN_EXCLUDE_COLUMNS = ['PRODUKT_ID', 'BETREIBER_NAME', 'BETREIBER_ID', 'UMLAUF_ID', 'VERKEHRSMITTEL_TEXT', 'AN_PROGNOSE_STATUS', 'AB_PROGNOSE_STATUS', 'HALTESTELLEN_NAME']
-
Run the data processing script:
python dat503.py
Download and extract a ZIP file from the specified URL.
Parameters:
base_url
(str): The base URL for the data files.target_folder
(str): The folder where the extracted files should be saved.month
(str): The month for which data should be downloaded.
Load and preprocess data from the specified folder.
Parameters:
train_folder
(str): The folder containing the CSV files to be processed.filters
(dict): A dictionary of filters to apply to the data.output_file_path
(str): The file path where the processed data should be saved.exclude_columns
(list): A list of columns to exclude from the data.delimiter
(str): The delimiter used in the CSV files.
Returns:
str
: The path to the saved Parquet file, orNone
if an error occurred.
Get a list of CSV files in the specified folder.
Parameters:
folder
(str): The folder to search for CSV files.
Returns:
list
: A list of file paths to the CSV files in the folder.
Load data from CSV files into a Dask DataFrame.
Parameters:
data_files
(list): A list of file paths to the CSV files.delimiter
(str): The delimiter used in the CSV files.
Returns:
dask.dataframe.DataFrame
: The loaded data, orNone
if an error occurred.
Exclude specified columns from the data.
Parameters:
data
(dask.dataframe.DataFrame): The data from which columns should be excluded.exclude_columns
(list): A list of columns to exclude.
Returns:
dask.dataframe.DataFrame
: The data with specified columns excluded, orNone
if an error occurred.
Encode all columns to int64 using Dask's parallel processing.
Parameters:
data
(dask.dataframe.DataFrame): The data to be encoded.
Returns:
dask.dataframe.DataFrame
: The encoded data, orNone
if an error occurred.
Preprocess and save data to a Parquet file.
Parameters:
data
(dask.dataframe.DataFrame): The data to be processed.filters
(dict): A dictionary of filters to apply to the data.exclude_columns
(list): A list of columns to exclude from the data.output_file_path
(str): The file path where the processed data should be saved.
Returns:
str
: The path to the saved Parquet file, orNone
if an error occurred.
Apply filters to the data.
Parameters:
data
(dask.dataframe.DataFrame): The data to be filtered.filters
(dict): A dictionary of filters to apply.
Returns:
dask.dataframe.DataFrame
: The filtered data, orNone
if an error occurred.
Preprocess the data by filling missing values and inferring object types.
Parameters:
data
(dask.dataframe.DataFrame): The data to be preprocessed.
Returns:
dask.dataframe.DataFrame
: The preprocessed data, orNone
if an error occurred.
Calculate time differences between specified columns.
Parameters:
data
(dask.dataframe.DataFrame): The data for which time differences should be calculated.
Returns:
dask.dataframe.DataFrame
: The data with calculated time differences, orNone
if an error occurred.
Save the processed data to a Parquet file.
Parameters:
data
(dask.dataframe.DataFrame): The processed data to be saved.output_file_path
(str): The file path where the data should be saved.
Returns:
str
: The path to the saved Parquet file, orNone
if an error occurred.
- Configure Logging: Set up logging to capture the process details.
- Define Constants: Set the base URL, folder paths, and other constants.
- Download Data: Download and extract data files if needed.
- Load and Preprocess Data: Load CSV files into a Dask DataFrame, apply filters, exclude columns, preprocess data, calculate time differences, and encode categorical columns.
- Save Processed Data: Save the processed data to a Parquet file.
- Train Model: Load the processed data, split it into training and testing sets, train a RandomForest model, and evaluate its accuracy.
Logs are saved to dat503.log
with detailed information about each step, including any errors encountered.
- Ensure that the
target_column
in thedat503.py
script is replaced with the actual target column name from your dataset. - Adjust the filters and excluded columns as needed based on your specific requirements.
This project is licensed under the MIT License.