ODaX is an open-source data analytics platform designed for the analysis of open data. It serves as a tool to demonstrate to clients the added value of data and how easily and efficiently valuable insights can be extracted from it.
Install wsl (with the Ubuntu distribution) on your Windows dev machine or use Ubuntu instead. Make sure that you do a restart after you installed wsl.
- To generate an ssh-key use
ssh-keygen -t ecdsain the .ssh folder of your Windows user account, e.g.,C:\Users\{USER}\.ssh. - Then open the Windows explorer and type
\\wsl.localhost\Ubuntu\home\{USER}in the explorer. If there is not already a folder.ssh, create one. Then copy the generated ssh keys fromC:\Users\{USER}\.sshto the.sshfolder in the wsl file system. - Please add your public ssh key to your GitHub account.
- Install pycharm and run it as admin.
- As you have previously installed wsl and restarted the computer, there should be an Ubuntu terminal based on wsl available in your Pycharm. If so, please open this terminal. If not, please check your wsl installation.
- You can check with
pwdif the selected directory in your Ubuntu terminal session is/home/{USER}. If not, navigate there and perform the following steps.
- Clone the repository https://github.com/AkrosAG/ODaX-Open-Data-eXplorer via git in the wsl filesystem, e.g., under
/home/{USER}. - Checkout the branch
develop. - Delete the folder
.vitualenvsin/home/{USER}if it exists. - Navigate to the root directory of your project
/home/{USER}/ODaX-Open-Data-eXplorer. - Create a virtual environment via
python3.11 -m venv venv. At the moment, apache superset requires python 3.11. - If step 4 does not work, you may need to install the venv extension for your python, i.e., by
sudo apt install python3.X-venv. After the installation, try again to create a virtual environment. - Using
wsl, activate the virtual environment bysource venv/bin/activate. - Update pip via
pip install --upgrade pip - Install
poetryin the virtual environment viapip install poetry. - Install the dependencies via
poetry install --no-root. - Go to the Pycharm
Settings, then theProject: ODaX-Open-Data-eXplorer, and then theProject Interpreter. - Click on
Add interpreterand selectOn WSL. - Add an
existingpython interpreter and select the python in the previously created venv/home/{USER}/ODaX-Open-Data-eXplorer/venv/bin/python3.X. - Run the python scripts from the root directory in the terminal or use the Pycharm
Run/Debugfunctionality. Remember to set theWorking directoryin theRun/ Debug configurationto the project root directory. - Side note: If you want to run a Python file as a Jupyter Notebook, you can right-click on the file and select
Convert to Jupyter Notebook.
In WSL, the linux distribution Ubuntu is emulated. Please follow the installation steps for Ubuntu documented at https://podman.io/docs/installation. Mainly, sudo apt -y install podman needs to be executed.
For using superset, several Linux packages need to be added. Please follow the installation steps documented at https://superset.apache.org/docs/installation/pypi for the Ubuntu 24.04 LTS distribution.
Please follow the installation steps documented at https://www.postgresql.org/download/linux/ubuntu/ to install postgresql in your distribution packages.
Mainly, the command sudo apt install postgresql that needs to be executed.
sudo apt install libpq-dev
poetry add psycopg2
Please make sure to get an API key from the website https://home.openweathermap.org/ for receiving the current air quality data via a REST API. Optionally, if you want to use the AirIQ air quality data, you can request an API key from https://www.iqair.com/ . Create an .env file in the project root and add the API keys as values of the following variables:
APIKeyOpenWeatherMap = ""
APIKeyAirIQ = ''
Due to an error when using superset which was based on a bug in the marshmellow package 4.0.0, we have downgraded the marshmellow package to the previous version. This may get obsolete in the future.
In case, the execution of a bash script fails, please execute the following command:
sudo apt install dos2unix and next dos2unix BASH-Script.sh. It ensures, that the line endings are correct and an execution on a Linux system is possible.
ODaX follows a modular architecture designed for flexibility and extensibility in data analysis workflows. The architecture consists of the following key components:
-
Data Import Layer: Specialized modules in the
impingpackage handle importing data from various sources, including CSV files, Excel files, and external APIs. Each data source has its own dedicated module with specific functions for data retrieval and initial processing. -
Data Processing Layer: Once imported, data is processed using pandas DataFrames for manipulation, transformation, and analysis. This layer includes functions for coordinate transformations, data interpolation, and statistical calculations.
-
Analysis Layer: This layer combines data from different sources to extract insights. It includes functions for correlating data (e.g., air quality with health insurance fees) and performing statistical analyses.
-
Visualization Layer: The project uses Apache Superset for creating interactive visualizations and dashboards to present the analysis results.
-
Notebook Interface: Jupyter notebooks provide an interactive environment for data exploration and analysis, allowing users to combine code, visualizations, and documentation.
The architecture is designed to be modular, allowing new data sources to be added easily by creating new modules in the imping package.
ODaX is organized into the following main modules:
Contains raw and processed data files used by the application:
healthinsurance/: Health insurance data from the Swiss Federal Office of Public Health (BAG)nabel/: Air quality data from the National Air Pollution Monitoring Network (NABEL)
Modules for importing and processing health insurance data:
lib_healthinsurance.py: Functions for loading health insurance data, retrieving fee information, and mapping municipalities to fee regions
Modules for importing and processing meteorological data from MeteoSwiss:
getStations.py: Functions for retrieving station information from MeteoSwiss
Modules for importing and processing air quality data:
lib_openweathermap.py: Functions for retrieving current air quality data from the OpenWeatherMap APIlib_geocoordinates.py: Functions for coordinate transformations (Swiss LV95 to WGS84) and spatial interpolation of air quality data
run_openweathermap.py: Script for fetching air quality data from OpenWeatherMap APIrun_GeocoordinateTransformation.py: Script for transforming Swiss coordinates to WGS84airquality_healthinsurancefees.py: Comprehensive analysis script that combines air quality and health insurance data
airquality_healthinsurancefees.ipynb: Interactive notebook version of the analysis script for exploring the relationship between air quality and health insurance fees
- Add the configuration to the
.envfile and set the SUPERSET_ADMIN_PASSWORD to a custom one of your choice:
SUPERSET_PORT=8088
SUPERSET_SECRET_KEY=q2l3r6YHD0l8C6p7xzFh4WyiU6k7uHs1y+V5rQ0CgLg=
SUPERSET_ADMIN_USERNAME=admin
SUPERSET_ADMIN_PASSWORD={PASSWORD}
[email protected]
SUPERSET_CONTAINER_NAME=odax-superset
- By following the set setup 2) in "Database setup and seeding", you also create a superset instance in your podman container.
- To the existing .env file, please add the following lines and choose a password for postgres.
CONTAINER_NAME="odax-pg"
POSTGRES_VERSION="15"
POSTGRES_PASSWORD="{PASSWORD}"
POSTGRES_DB="odax_test"
HOST_PORT="5433"
VOLUME_NAME="pgdata_odax"
- Next, create the database setup by executing the following from the root directory:
./deployment/database/setup_odax_health.sh
./deployment/database/setup_odax_airpollution.sh
In case, you get an error about the encoding used, follow these steps:
sudo apt-get update && sudo apt-get install -y dos2unix
dos2unix deployment/database/setup_odax_health.sh
dos2unix deployment/database/setup_odax_airpollution.sh
- Next, seed the database by executing the following scripts within your virtual environment:
python deployment/database/seed_airpollution_scheme.py
python deployment/database/seed_healthinsurance_scheme.py
The Professional version of Pycharm offers the functionality to visualize the structure and content of a database.
- Determine your internal WSL ip address via the command:
ip addr show eth0
- Open the Database action in Pycharm and select
Newand click onData Sourcenext. Then, selectPostgreSQLand againPostgreSQL. - In the configuration insert the previously determined ip address as a host.
- Fill the postgres port and password defined in your
.envfile. The use istpostgres. - The URL is a connection string like
jdbc:postgresql://{IP}:{PORT}/{DATABASE}where {IP} is the previously determined ip address, {PORT} is the port defined in the.envfile, and {DATABASE} is also defined in the.envfile. - Test the connection by clicking on
Test connection. If the connection is fine, click onOK. - Refresh the connections to load and visualize the content of your containers databases.