📊 data science 🧬 bioinformatics 🧮 algorithm development </> software development 📖 machine learning 🐍 Python
This page provides an overview of my public GitHub projects. For an overview of the projects that I have worked on professionally, please view my LinkedIn profile. If you have an idea for a project, please reach out. I'd love to collaborate!
Table of Contents
Keywords: bioinformatics | Python | numba | numpy. Homepage | API | Repo
This is an open-source bioinformatics project that implements fast and memory-efficient genome k-mer calculations. It can be installed from PyPI with pip install genome-kmers. A genome is often comprised of multiple chromosomes and the sequence of each chromosome can be represented as a long string of bases (A, T, G, or C). A numba to run orders of magnitude faster than pure Python. Check out the homepage for more information.
Keywords: bioinformatics | Python | Streamlit | Docker | Azure | web application | single cell RNA-Seq | scanpy. Repo | Demo Video | Demo website
This is a bioinformatics web application that provides a simple, customizable user interface for the common single cell RNA-Seq preprocessing and analysis tasks, which include:
- Quality Control
- Doublet Detection
- Normalization
- Feature Selection
- PCA for dimensional reduction
- UMAP / t-SNE for dimensional reduction and visualization
- Clustering
The web application uses the Streamlit for the web framework and scanpy for some of the analysis. A Dockerfile is provided for deployment, and a demo version of the web application has been deployed to Azure and is available at https://mattperkett.com/single-cell/.
Keywords: scikit-learn | signal processing View Repo
This project demonstrates how to use Independent Component Analysis (ICA) to deconvolute mixed audio signals. A classic application of ICA is to the cocktail party problem of trying to listen to a single person talking in a noisy room. To mimic this situation, I programmatically mix audio recordings and then attempt to deconvolute the signal into separate recordings. Since we have the original recordings, it is possible to quantify the level of success.
I decided to undertake a broad review of data science and machine learning to reinforce my knowledge of the fundamental statistics/algorithms and get experience with a broader range of tools and libraries. I have outlined the larger trainings that I am working through below.
Since I already have a foundation in machine learning, data science, and computer science fundamentals from my academic and professional background, I have used these trainings as a broad review to refresh my background, fill in gaps, and work on targeted projects.
- Book: An Introduction to Statistical Learning
$\color{purple}{\textsf{In Progress}}$ - Reading and working through exercises
- Online: Introduction to Machine Learning with TensorFlow
$\color{green}{\textsf{Complete}}$ - Online: Introduction to Machine Learning with PyTorch
$\color{green}{\textsf{Complete}}$ - Online: AI Programming with Python
$\color{green}{\textsf{Complete}}$
Keywords: unsupervised learning | dimensional reduction | clustering | Docker. View Repo
This project identifies distinct demographics group German census data using PCA for dimensional reduction and K-Means clustering for group identification. It then uses the demographics clusters to explore a company's customer base and identify potential opportunities (e.g. expanding user base, targeted marketing, etc). Like any project using realistic data, this project devotes a significant portion of the analysis to data preprocessing.
Keywords: supervised learning | classification | random forest | SVM | Gaussian Naive Bayes | scikit-learn. View Repo
This project builds a model to predict likely donors using basic demographic information (e.g. age, occupation, and education level). Several supervised learning classifiers were tested before moving forward with the Random Forest classifier. The best model was identified with hyperparameter tuning. Testing for accuracy was done on withheld data (accuracy = 86%,
Keywords: classification | TensorFlow | transfer learning | deep learning | Docker. View Repo
This project uses transfer learning to build a model that classifies an input image of a flower into one of 102 different species. The MobileNetV3 pretrained neural network is adapted to this task by freezing all layers except the final layer, which is replaced with a dense neural network for training. The added layers are densely-connected with a relu activation function and dropout layers are added for regularization. The final layer has softmax activation so that each flower type is predicted with probabilities that add to one. During training, loss is calculated using sparse categorical cross entropy and accuracy is used as the metric. An accuracy of 75% is achieved on the withheld test set. A Dockerfile is provided to quickly get the notebook up and running.
Keywords: classification | PyTorch | transfer learning | deep learning. View Repo
This project is very similar to Flower Image Classification with TensorFlow, but is implemented using PyTorch.
Keywords: classification | PyTorch | deep learning | Docker. View Repo
This project classifies images from the CIFAR-10 image data set into one of 10 categories (airplane, automobile, bird, etc). To do this, a Convolutional Neural Network (CNN) is trained in PyTorch using accuracy as the metric. Multiple NNs are tested (varying the number and size of layers) before selecting the best classifier, which has an accuracy of 70%. The intent of this project is to work through all the necessary steps for training and validating a model rather. To quickly build something with higher accuracy, we could consider using transfer learning using a pretrained neural network. We would expect a good classifier to achieve and accuracy of >90% on this image set.








