Crawl film information from Ohitv.info and incrementally load updates daily into Postgres and MongoDB for analysis and visualization.
- Docker: Manages services like Airflow, Minio, Postgres, and MongoDB.
- Airflow: Orchestrates the ETL pipeline.
- Minio: Object storage for raw and processed data.
- Postgres: Relational database to store processed data.
- MongoDB: NoSQL database for testing queries and learning.
- PowerBI: Visualizes insights from the data.
- Python: Language used for scripting.
BeautifulSouppandasnumpyrequests
- Crawl film data from Ohitv.info using
requestsandBeautifulSoup. - Save the raw data to Minio under the bucket
ohitv-raw.
- Use Python to clean and transform the data.
- Store the transformed data in Minio under the bucket
ohitv-processed.
- Load transformed data into Postgres for visualization and analysis with PowerBI.
- Load data into MongoDB to test and learn NoSQL querying.
- Use PowerBI to visualize and report insights from the data stored in Postgres.
- Use Apache Airflow to manage and orchestrate the ETL pipeline, which runs daily.
- Follow the instructions to install Docker from the Docker website.
- Run the
docker_compose.batfile.
- Run the
docker_compose.shfile.
- Open Minio in your browser at
localhost:9001. - Login:
- Username:
admin12345 - Password:
admin12345
- Username:
- Create Access Keys:
- Navigate to "Access Keys" in the left menu and generate new access and secret keys.
- Make sure to store these keys securely.
- Create a
keys.jsonfile in thepluginsdirectory with the following content:
{
"access_key": "replace your access_key",
"secret_key": "replace your secret_key",
"mongodb_user": "admin",
"mongodb_password": "admin",
"postgres_user": "airflow",
"postgres_password": "airflow"
}- Open Airflow in your browser at
localhost:8080 Run the Pipeline: Click theRunbutton on the DAG to execute the workflow.

