Skip to content
View katarzynarogalska's full-sized avatar

Block or report katarzynarogalska

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this userโ€™s behavior. Learn more about reporting abuse.

Report abuse
katarzynarogalska/README.md

Hi there ๐Ÿ‘‹


Welcome on my Github page, my name is Kasia and I am a Data Science student at Warsaw University of Technology.

Focus areas

  • ๐Ÿค– Machine Learning, AutoML
  • ๐Ÿ“Š Data Visualization and Analysis
  • ๐Ÿง  Interpretability of ML/Deep Learning Models
  • ๐Ÿ› ๏ธ Data Preprocessing
  • โ˜๏ธ Big Data, Data Warehousing
  • ๐Ÿ“ Statistics, Optimization Methods

Projects

Here you can read more about my most recent projects.

๐Ÿ› ๏ธ AutoPrep - AutoML open-source Python package

AutoPrep is an open-source Python package published on PyPI focused on automated machine learning with advanced data preprocessing techniques and model explainability methods. The key feature of this package is builing k independent pipelines with different sets of preprocessing methods, training models and choosing 3 best performing pipelines to generate a comprehensive report on the preprocessing methods used in the winner-pipelines.

Key features of the package are:

  1. Automated data cleaning and preprocessing - Scaling, Encoding, Missing data imputation, Detecting highly correlated features, Detecting features with 0 variance and unique values
  2. Intelligent feature type detection to assess the task type - binary/multiclass classification and regression
  3. Feature selection and dimensionality reduction - PCA, UMAP, VIF, RF Feature importance, Correlation thresholds
  4. Automated model training - KNN, Linear/Logistic Regression, SVM/SVR, Decision Tree, Random Forest, GBM, Bayesian Ridge
  5. Hyperparameters tuning - Random Search with CV
  6. Generating LaTex report with numerical scores, description of the preprocessing techniques, plots and SHAP interpretability metrics.

๐Ÿ“ŠARSA ML - Rashomon Set analysis open-source Python package

ARSA ML is an open-source Python package developed to allow automated analysis of the Rashomon Set (a set of similarly well performing models, which despite achieving similarly high scores can have different charateristics) and the predictive multiplicity problem (a situation when multiple models produce conficting predictions for the same observations). The package is pubslished on PyPI, along with a demo Streamlit application ARSA ML Website allowing experimentation with the package features in a no-code manner. Key features of the package are:

  1. Compatibility with AutoGluon and H2O AutoML frameworks
  2. Pipelines allowing automated construction of the Rashomon Set on user-provided dataset and trained models
  3. Detailed Streamlit dashboard allowing analysis of the predictive multiplicity metrics and model characteristics
  4. Rashomon Intersection - a novel approach proposing evaluation of similary-well performing models based on 2, instaed of just one, evaluation metrics

This project was developed in order to determine a relationship between the spikes in negative, political Reddit posts and stability of the cryptocurrency market. In the project we used Apache Hadoop to store historical data, Apache NiFi to transform incoming stream data and Apache Spark and HBase to generate and save views with the processed analyses.

Key features of this project are:

  1. Apache NiFi - fetch streaming data from an API, select necessary features, use regex to determine policial posts and save data in Parquet format
  2. Apache Hadoop - storing incoming large volumes of data in Parquet
  3. Apache Spark - creating datasets and performing analyses, joining multiple datasets to obtain correlations and create final views
  4. Apache HBase - storing views created with Apache Spark to enable random reads

๐ŸŽฏ Business Analysis in PowerBI - Sales data analysis

This project aims to analyze Kaggle Sales Dataset using PowerBI and Power Query in order to draw valuable business insights.

Key features of this project are:

  1. Raw data analysis
  2. Data transformation using Power Query - data types, new columns, data cleaning
  3. Data modeling - establishing relationships between tables
  4. Dashboard creation in PowerBI
  5. Drawing conclusions from the generated report

And many more ... which can be found in the corresponding repositories

Courses and skills

During my studies I completed the following relevant courses:

  • ๐Ÿ’ป Programming - Python, Java, R, SQL
  • ๐Ÿ“Š Data Visualization Techniques
  • ๐Ÿค– Machine Learning, AutoML
  • ๐Ÿ“ Statistics
  • โšก Optimization Methods
  • โž— Linear Algebra
  • โ˜๏ธ Data Warehousing, Big Data
  • ๐Ÿ”ข Numerical Methods
  • ๐ŸŽฒ Stochastic Processes

Development tools

JavaPythonGitrlangVS CodeMicrosoft Azure

Socials

For collaboration please contact me !

LinkedIn

Pinned Loading

  1. ARSA-Automated-Rashomon-Set-Analysis ARSA-Automated-Rashomon-Set-Analysis Public

    Jupyter Notebook 1 1

  2. AutoPrep AutoPrep Public

    Forked from Pawlo77/AutoPrep

    Automatic ML library for enhanced data preprocessing and explainability

    Python

  3. Big-Data-Apache-project Big-Data-Apache-project Public

    Forked from sienkozuzanna/big-data-war-news

    Java

  4. Neural-Networks Neural-Networks Public

    Jupyter Notebook

  5. PowerBI-Business-Analysis PowerBI-Business-Analysis Public