This repository provides a technical training on data lake technologies for IoT data management. It focuses on practice by providing notebooks that you can actually run, modify and enhance yourself. Feel encouraged to experiment with your own data and own settings. Contributions and pull requests are highly welcome!
Learn practical skills for IoT data management and analytics:
- Apache Parquet file formats, encodings, and compression strategies
- Apache Iceberg table operations, schema evolution, and time travel
- Data lake ecosystem tools and integration patterns, and the medallion architecture
- Performance optimization techniques for large-scale data processing
- Module 1: Apache Parquet
- Module 2: Apache Iceberg
- Module 3: Apache Spark and the medallion architecture
Note: Module 2 and 3 are currently work in progress.
The training is oriented towards people with some level of development background and technical interest. The following technology stack is used:
- Python - Core programming language
- Jupyter Lab - Interactive development environment
- Apache Parquet - Columnar storage format
- Apache Iceberg - Table format for data lakes
- PyArrow - Fast columnar data processing
- Pandas - Data manipulation and analysis
- Daft - In-process analytical database
- Plotly - Interactive visualizations
You do not need to know the entire stack, but some level of Python and SQL knowledge is required. It is also helpful to have a basic understanding of Jupyter Notebooks. If you are a VS Code user, there is an extension for notebooks and for Python.
Since different programming languages use different optimization heuristics when writing Parquet files, code for other languages is also being developed.
If you haven't already, download and install Python. There are several options to use the notebooks:
- From Visual Studio Code.
- From Jupyter Lab.
To use the notebooks directly from Visual Studio Code:
- Install the Python and Jupyter extensions from Microsoft in Visual Studio code.
- Use "Python: Create environment" and make sure to check the "requirements.txt" file to install also the dependencies.
- Open
01_parquet/00_from_json_to_sql.ipynbto begin. - Click "Run all" to run the notebook. You may be asked to select a runtime environment. Selct the ".venv" environment that you just created.
To use the notebooks through Jupyter Lab: Install the requirements and launch Jupter Lab, then open notebooks/01_getting_started.ipynb to begin.
pip install -r requirements.txt
jupyter labPlease remember to clear the output in your notebook before committing it.
During creation the course, I used Copilot and Claude Code to review the curriculum for correctness. Parts of the notebook and helper code are generated.