This repository contains an Apache Airflow project for car price prediction. The project is structured with two main folders: dags
and modules
.
- hw_dag.py: The main Directed Acyclic Graph (DAG) file for the car price prediction project. It defines the workflow of the tasks using Apache Airflow's PythonOperator.
-
pipeline.py: Contains the data processing pipeline, including functions for filtering data, removing outliers, and creating features. It also defines the machine learning pipeline using scikit-learn with models such as Logistic Regression, Random Forest Classifier, and Support Vector Classifier (SVC).
-
predict.py: Implements the prediction task using the latest trained model. It loads the most recent model from the
data/models
directory and makes predictions on the test data.
- Apache Airflow installed
- Python 3.x
-
Clone the repository:
git clone [repository_url] cd [repository_directory]
-
Install dependencies:
pip install -r requirements.txt
-
Set up Airflow:
- Ensure the
AIRFLOW_HOME
environment variable is set to the path where Airflow should store its configuration. - Initialize the Airflow database:
airflow db init
- Ensure the
-
Start the Airflow web server:
airflow webserver
-
Start the Airflow scheduler in a new terminal:
airflow scheduler
-
Access the Airflow web UI in your browser (default: http://localhost:8080/).
-
Enable the
car_price_prediction
DAG from the web UI. -
Trigger the DAG manually or let it run based on the specified schedule.
- Apache Airflow
- pandas
- scikit-learn
- dill
The DAG is scheduled to run daily at 15:00 UTC.
Roman Kovalenko