katarzynarogalska

Hi there 👋

Welcome on my Github page, my name is Kasia and I am a Data Science student at Warsaw University of Technology.

Focus areas

🤖 Machine Learning, AutoML
📊 Data Visualization and Analysis
🧠 Interpretability of ML/Deep Learning Models
🛠️ Data Preprocessing
☁️ Big Data, Data Warehousing
📐 Statistics, Optimization Methods

Projects

Here you can read more about my most recent projects.

🛠️ AutoPrep - AutoML open-source Python package

AutoPrep is an open-source Python package published on PyPI focused on automated machine learning with advanced data preprocessing techniques and model explainability methods. The key feature of this package is builing k independent pipelines with different sets of preprocessing methods, training models and choosing 3 best performing pipelines to generate a comprehensive report on the preprocessing methods used in the winner-pipelines.

Key features of the package are:

Automated data cleaning and preprocessing - Scaling, Encoding, Missing data imputation, Detecting highly correlated features, Detecting features with 0 variance and unique values
Intelligent feature type detection to assess the task type - binary/multiclass classification and regression
Feature selection and dimensionality reduction - PCA, UMAP, VIF, RF Feature importance, Correlation thresholds
Automated model training - KNN, Linear/Logistic Regression, SVM/SVR, Decision Tree, Random Forest, GBM, Bayesian Ridge
Hyperparameters tuning - Random Search with CV
Generating LaTex report with numerical scores, description of the preprocessing techniques, plots and SHAP interpretability metrics.

📊ARSA ML - Rashomon Set analysis open-source Python package

ARSA ML is an open-source Python package developed to allow automated analysis of the Rashomon Set (a set of similarly well performing models, which despite achieving similarly high scores can have different charateristics) and the predictive multiplicity problem (a situation when multiple models produce conficting predictions for the same observations). The package is pubslished on PyPI, along with a demo Streamlit application ARSA ML Website allowing experimentation with the package features in a no-code manner. Key features of the package are:

Compatibility with AutoGluon and H2O AutoML frameworks
Pipelines allowing automated construction of the Rashomon Set on user-provided dataset and trained models
Detailed Streamlit dashboard allowing analysis of the predictive multiplicity metrics and model characteristics
Rashomon Intersection - a novel approach proposing evaluation of similary-well performing models based on 2, instaed of just one, evaluation metrics

📈Reddit posts + Cryptocurrency market analysis - Big Data project using Apache

This project was developed in order to determine a relationship between the spikes in negative, political Reddit posts and stability of the cryptocurrency market. In the project we used Apache Hadoop to store historical data, Apache NiFi to transform incoming stream data and Apache Spark and HBase to generate and save views with the processed analyses.

Key features of this project are:

Apache NiFi - fetch streaming data from an API, select necessary features, use regex to determine policial posts and save data in Parquet format
Apache Hadoop - storing incoming large volumes of data in Parquet
Apache Spark - creating datasets and performing analyses, joining multiple datasets to obtain correlations and create final views
Apache HBase - storing views created with Apache Spark to enable random reads

🎯 Business Analysis in PowerBI - Sales data analysis

This project aims to analyze Kaggle Sales Dataset using PowerBI and Power Query in order to draw valuable business insights.

Key features of this project are:

Raw data analysis
Data transformation using Power Query - data types, new columns, data cleaning
Data modeling - establishing relationships between tables
Dashboard creation in PowerBI
Drawing conclusions from the generated report

And many more ... which can be found in the corresponding repositories

Courses and skills

During my studies I completed the following relevant courses:

💻 Programming - Python, Java, R, SQL
📊 Data Visualization Techniques
🤖 Machine Learning, AutoML
📐 Statistics
⚡ Optimization Methods
➗ Linear Algebra
☁️ Data Warehousing, Big Data
🔢 Numerical Methods
🎲 Stochastic Processes

Development tools

Socials

For collaboration please contact me !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly