Welcome on my Github page, my name is Kasia and I am a Data Science student at Warsaw University of Technology.
Focus areas
- ๐ค Machine Learning, AutoML
- ๐ Data Visualization and Analysis
- ๐ง Interpretability of ML/Deep Learning Models
- ๐ ๏ธ Data Preprocessing
- โ๏ธ Big Data, Data Warehousing
- ๐ Statistics, Optimization Methods
Here you can read more about my most recent projects.
๐ ๏ธ AutoPrep - AutoML open-source Python package
AutoPrep is an open-source Python package published on PyPI focused on automated machine learning with advanced data preprocessing techniques and model explainability methods. The key feature of this package is builing k independent pipelines with different sets of preprocessing methods, training models and choosing 3 best performing pipelines to generate a comprehensive report on the preprocessing methods used in the winner-pipelines.
Key features of the package are:
- Automated data cleaning and preprocessing - Scaling, Encoding, Missing data imputation, Detecting highly correlated features, Detecting features with 0 variance and unique values
- Intelligent feature type detection to assess the task type - binary/multiclass classification and regression
- Feature selection and dimensionality reduction - PCA, UMAP, VIF, RF Feature importance, Correlation thresholds
- Automated model training - KNN, Linear/Logistic Regression, SVM/SVR, Decision Tree, Random Forest, GBM, Bayesian Ridge
- Hyperparameters tuning - Random Search with CV
- Generating LaTex report with numerical scores, description of the preprocessing techniques, plots and SHAP interpretability metrics.
๐ARSA ML - Rashomon Set analysis open-source Python package
ARSA ML is an open-source Python package developed to allow automated analysis of the Rashomon Set (a set of similarly well performing models, which despite achieving similarly high scores can have different charateristics) and the predictive multiplicity problem (a situation when multiple models produce conficting predictions for the same observations). The package is pubslished on PyPI, along with a demo Streamlit application ARSA ML Website allowing experimentation with the package features in a no-code manner. Key features of the package are:
- Compatibility with AutoGluon and H2O AutoML frameworks
- Pipelines allowing automated construction of the Rashomon Set on user-provided dataset and trained models
- Detailed Streamlit dashboard allowing analysis of the predictive multiplicity metrics and model characteristics
- Rashomon Intersection - a novel approach proposing evaluation of similary-well performing models based on 2, instaed of just one, evaluation metrics
๐Reddit posts + Cryptocurrency market analysis - Big Data project using Apache
This project was developed in order to determine a relationship between the spikes in negative, political Reddit posts and stability of the cryptocurrency market. In the project we used Apache Hadoop to store historical data, Apache NiFi to transform incoming stream data and Apache Spark and HBase to generate and save views with the processed analyses.
Key features of this project are:
- Apache NiFi - fetch streaming data from an API, select necessary features, use regex to determine policial posts and save data in Parquet format
- Apache Hadoop - storing incoming large volumes of data in Parquet
- Apache Spark - creating datasets and performing analyses, joining multiple datasets to obtain correlations and create final views
- Apache HBase - storing views created with Apache Spark to enable random reads
This project aims to analyze Kaggle Sales Dataset using PowerBI and Power Query in order to draw valuable business insights.
Key features of this project are:
- Raw data analysis
- Data transformation using Power Query - data types, new columns, data cleaning
- Data modeling - establishing relationships between tables
- Dashboard creation in PowerBI
- Drawing conclusions from the generated report
And many more ... which can be found in the corresponding repositories
During my studies I completed the following relevant courses:
- ๐ป Programming - Python, Java, R, SQL
- ๐ Data Visualization Techniques
- ๐ค Machine Learning, AutoML
- ๐ Statistics
- โก Optimization Methods
- โ Linear Algebra
- โ๏ธ Data Warehousing, Big Data
- ๐ข Numerical Methods
- ๐ฒ Stochastic Processes
For collaboration please contact me !

