Data Processing with Dask

Requirements

Python 3.7+
Dask
Dask-ML
scikit-learn
pandas
pyarrow
tqdm
requests

Installation

Clone the repository:

git clone https://github.com/yourusername/dat503.git
cd dat503

Install the required packages:
```
pip install -r requirements.txt
```

Usage

Configure the logging settings in dat503.py:

def configure_logging():
    """Configure logging settings."""
    logging.basicConfig(
        filename='dat503.log',
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(message)s - %(filename)s:%(lineno)d',
        filemode='w'  # Overwrite the log file on each run
    )
    logging.captureWarnings(True)

Set the constants in dat503.py:

BASE_URL = "https://opentransportdata.swiss/wp-content/uploads/ist-daten-archive/"
TRAIN_FOLDER = os.path.join(os.path.dirname(__file__), 'data', 'train')
FORCE_DOWNLOAD = False  # Set to True to download the data
NUM_MONTHS = 3  # Number of months to download
TRAIN_FILTERS = {'LINIEN_TEXT': ['IC2', 'IC3', 'IC5', 'IC6', 'IC8', 'IC21']}
TRAIN_OUTPUT_FILE_PATH = os.path.join(TRAIN_FOLDER, 'working', 'processed_data.parquet')
TRAIN_EXCLUDE_COLUMNS = ['PRODUKT_ID', 'BETREIBER_NAME', 'BETREIBER_ID', 'UMLAUF_ID', 'VERKEHRSMITTEL_TEXT', 'AN_PROGNOSE_STATUS', 'AB_PROGNOSE_STATUS', 'HALTESTELLEN_NAME']

Run the data processing script:
```
python dat503.py
```

Functions Overview

`download_extract(base_url, target_folder, month)`

Download and extract a ZIP file from the specified URL.

Parameters:

base_url (str): The base URL for the data files.
target_folder (str): The folder where the extracted files should be saved.
month (str): The month for which data should be downloaded.

`load_and_preprocess_data(train_folder, filters, output_file_path, exclude_columns, delimiter=';')`

Load and preprocess data from the specified folder.

Parameters:

train_folder (str): The folder containing the CSV files to be processed.
filters (dict): A dictionary of filters to apply to the data.
output_file_path (str): The file path where the processed data should be saved.
exclude_columns (list): A list of columns to exclude from the data.
delimiter (str): The delimiter used in the CSV files.

Returns:

str: The path to the saved Parquet file, or None if an error occurred.

`_get_csv_files(folder)`

Get a list of CSV files in the specified folder.

Parameters:

folder (str): The folder to search for CSV files.

Returns:

list: A list of file paths to the CSV files in the folder.

`_load_data(data_files, delimiter)`

Load data from CSV files into a Dask DataFrame.

Parameters:

data_files (list): A list of file paths to the CSV files.
delimiter (str): The delimiter used in the CSV files.

Returns:

dask.dataframe.DataFrame: The loaded data, or None if an error occurred.

`_exclude_columns(data, exclude_columns)`

Exclude specified columns from the data.

Parameters:

data (dask.dataframe.DataFrame): The data from which columns should be excluded.
exclude_columns (list): A list of columns to exclude.

Returns:

dask.dataframe.DataFrame: The data with specified columns excluded, or None if an error occurred.

`_encode_categorical_columns(data)`

Encode all columns to int64 using Dask's parallel processing.

Parameters:

data (dask.dataframe.DataFrame): The data to be encoded.

Returns:

dask.dataframe.DataFrame: The encoded data, or None if an error occurred.

`preprocess_and_save_data(data, filters, exclude_columns, output_file_path)`

Preprocess and save data to a Parquet file.

Parameters:

data (dask.dataframe.DataFrame): The data to be processed.
filters (dict): A dictionary of filters to apply to the data.
exclude_columns (list): A list of columns to exclude from the data.
output_file_path (str): The file path where the processed data should be saved.

Returns:

str: The path to the saved Parquet file, or None if an error occurred.

`_apply_filters(data, filters)`

Apply filters to the data.

Parameters:

data (dask.dataframe.DataFrame): The data to be filtered.
filters (dict): A dictionary of filters to apply.

Returns:

dask.dataframe.DataFrame: The filtered data, or None if an error occurred.

`_preprocess_data(data)`

Preprocess the data by filling missing values and inferring object types.

Parameters:

data (dask.dataframe.DataFrame): The data to be preprocessed.

Returns:

dask.dataframe.DataFrame: The preprocessed data, or None if an error occurred.

`_calculate_time_differences(data)`

Calculate time differences between specified columns.

Parameters:

data (dask.dataframe.DataFrame): The data for which time differences should be calculated.

Returns:

dask.dataframe.DataFrame: The data with calculated time differences, or None if an error occurred.

`save_processed_data(data, output_file_path)`

Save the processed data to a Parquet file.

Parameters:

data (dask.dataframe.DataFrame): The processed data to be saved.
output_file_path (str): The file path where the data should be saved.

Returns:

str: The path to the saved Parquet file, or None if an error occurred.

Example Workflow

Configure Logging: Set up logging to capture the process details.
Define Constants: Set the base URL, folder paths, and other constants.
Download Data: Download and extract data files if needed.
Load and Preprocess Data: Load CSV files into a Dask DataFrame, apply filters, exclude columns, preprocess data, calculate time differences, and encode categorical columns.
Save Processed Data: Save the processed data to a Parquet file.
Train Model: Load the processed data, split it into training and testing sets, train a RandomForest model, and evaluate its accuracy.

Logging

Logs are saved to dat503.log with detailed information about each step, including any errors encountered.

Notes

Ensure that the target_column in the dat503.py script is replaced with the actual target column name from your dataset.
Adjust the filters and excluded columns as needed based on your specific requirements.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
archive		archive
esa2		esa2
.gitignore		.gitignore
README.md		README.md
create_haltestellen.py		create_haltestellen.py
merge_train_weather.py		merge_train_weather.py
missing_stations.py		missing_stations.py
requirements.txt		requirements.txt
testing_traindelay.py		testing_traindelay.py
train_processor.py		train_processor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Processing with Dask

Requirements

Installation

Usage

Functions Overview

`download_extract(base_url, target_folder, month)`

`load_and_preprocess_data(train_folder, filters, output_file_path, exclude_columns, delimiter=';')`

`_get_csv_files(folder)`

`_load_data(data_files, delimiter)`

`_exclude_columns(data, exclude_columns)`

`_encode_categorical_columns(data)`

`preprocess_and_save_data(data, filters, exclude_columns, output_file_path)`

`_apply_filters(data, filters)`

`_preprocess_data(data)`

`_calculate_time_differences(data)`

`save_processed_data(data, output_file_path)`

Example Workflow

Logging

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

1naboki1/dat503

Folders and files

Latest commit

History

Repository files navigation

Data Processing with Dask

Requirements

Installation

Usage

Functions Overview

download_extract(base_url, target_folder, month)

load_and_preprocess_data(train_folder, filters, output_file_path, exclude_columns, delimiter=';')

_get_csv_files(folder)

_load_data(data_files, delimiter)

_exclude_columns(data, exclude_columns)

_encode_categorical_columns(data)

preprocess_and_save_data(data, filters, exclude_columns, output_file_path)

_apply_filters(data, filters)

_preprocess_data(data)

_calculate_time_differences(data)

save_processed_data(data, output_file_path)

Example Workflow

Logging

Notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

`download_extract(base_url, target_folder, month)`

`load_and_preprocess_data(train_folder, filters, output_file_path, exclude_columns, delimiter=';')`

`_get_csv_files(folder)`

`_load_data(data_files, delimiter)`

`_exclude_columns(data, exclude_columns)`

`_encode_categorical_columns(data)`

`preprocess_and_save_data(data, filters, exclude_columns, output_file_path)`

`_apply_filters(data, filters)`

`_preprocess_data(data)`

`_calculate_time_differences(data)`

`save_processed_data(data, output_file_path)`

Packages