This catalog is a collection of repositories for various Machine Learning techniques and algorithms implemented at Vector Institute. The table has the following columns:
- Repository: Link to the Github repo.
- Description: A brief introduction to the repository stating its purpose and links to published research papers.
- Algorithms: List of ML algorithms demonstrated in the repo.
- No. of datasets: Total number of datasets utilized in the repo.
- Datasets: Links to any publicly available data. This is a subset of the total datasets mentioned in the repo.
Repository |
Description |
Algorithms |
No. of datasets |
Public Datasets |
Year |
---|---|---|---|---|---|
RAG | This repository contains demos for various Retrieval Augmented Generation techniques using different libraries. | Cloud search via LlamaHub, Document search via LangChain, LlamaIndex for OpenAI and Cohere models, Hybrid Search via Weaviate Vector Store, Evaluation via RAGAS library, Websearch via LangChain | 3 | Vectors 2021 Annual Report, PubMed Doc, Banking Deposits | 2024 |
Finetuning and Alignment | This repository contains demos for finetuning techniques for LLMs focussed on reducing computational cost. | DDP, FSDP, Instruction Tuning, LoRA, DoRA, QLora, Supervised finetuning | 3 | samsam, imdb, Bias-DeBiased | 2024 |
Prompt Engineering Laboratory | This repository contains demos for various Prompt Engineering techniques, along with examples for Bias quantification, text classification. | Stereotypical Bias Analysis, Sentiment inference, Finetuning using HF Library, Activation Generation, Train and Test Model for Activations without Prompts, RAG, ABSA, Few shot prompting, Zero shot prompting (Stochastic, Greedy, Likelihood Estimation), Role play prompting, LLM Prompt Summarization, Zero shot and few shot prompt translation, Few shot CoT, Zero shot CoT, Self-Consistent CoT prompting (Zero shot, 5-shot), Balanced Choice of Plausible Alternatives, Bootstrap Ensembling(Generation & MC formulation), Vote Ensembling | 11 | Crows-pairs, sst5, czarnowska templates, [cnn_dailymail], [ag_news], Weather and sports data, Other | 2024 |
bias-mitigation-unlearning | This repository contains code for the paper Can Machine Unlearning Reduce Social Bias in Language Models? which was published at EMNLP'24 in the Industry track. Authors are Omkar Dige, Diljot Arneja, Tsz Fung Yau, Qixuan Zhang, Mohammad Bolandraftar, Xiaodan Zhu, Faiza Khan Khattak. |
PCGU, Task vectors and DPO for Machine Unlearning | 20 | BBQ, Stereoset, Link1, Link2 | 2024 |
cyclops-workshop | This repository contains demos for using CyclOps package for clinical ML evaluation and monitoring. | XGBoost | 1 | Diabetes 130-US hospitals dataset for years 1999-2008 | 2024 |
odyssey | This is a library created with research done for the paper EHRMamba: Towards Generalizable and Scalable Foundation Models for Electronic Health Records published at ArXiv'24. Authors are Adibvafa Fallahpour, Mahshid Alinoori, Wenqian Ye, Xu Cao, Arash Afkanpour, Amrit Krishnan. |
EHRMamba, XGBoost, Bi-LSTM | 1 | MIMIC-IV | 2024 |
Diffusion model bootcamp | This repository contains demos for various diffusion models for tabular and time series data. | TabDDPM, TabSyn, ClavaDDPM, CSDI, TSDiff | 12 | Physionet Challenge 2012, wiki2000 | 2024 |
News Media Bias | This repository contains code for libraries and experiments to recognise and evaluate bias and fakeness within news media articles via LLMs. | Bias evaluation via LLMs, finetuning and data annotation via LLM for fake news detection, Supervised finetuning for debiasing sentence, NER for biased phrases via LLMS, Evaluate using DeepEval library | 4 | News Media Bias Full data, Toxigen, Nela GT, Debiaser data | 2024 |
News Media Bias Plus | Continuation of News Media Bias project, this repository contains code for libraries and experiments to collect and annotate data, recognise and evaluate bias and fakeness within news media articles via LLMs and LVMs. | Bias evaluation via LLMs and VLMs, finetuning and data annotation via LLM for fake news detection, supervised finetuning for debiasing sentence, NER for biased entities via LLMS | 2 | News Media Bias Plus Full Data, NMB Plus Named Entities | 2024 |
Anomaly Detection Project | This repository contains demos for various supervised and unsupervised anomaly detection techniques in domains such as Fraud Detection, Network Intrusion Detection, System Monitoring and image, Video Analysis. | AMNet, GCN, SAGE, OCGNN, DON, AdONE, MLP, FTTransformer, DeepSAD, XGBoost, CBLOF, CFA for Target-Oriented Anomaly Localization, Draem for surface anomaly detection, Logistic Regression, CATBoost, Random Forest, Diversity Measurable Anomaly Detection, Two-stream I3D Convolutional Network, DeepCNN, LightGBM, Isolation Forest, TabNet, AutoEncoder, Internal Contrastive Learning | 5 | On Vector Cluster | 2023 |
SSL Bootcamp | This repository contains demos for self-supervised techniques such as contrastive learning, masked modeling and self distillation. | Internal Contrastive Learning, LatentOD-AD, TabRet, SimMTM, Data2Vec | 52 | Beijing Air Quality, BRFSS, Stroke Prediction, STL10, Link1, Link2 | 2023 |
Causal Inference Lab | This repository contains code to estimate the causal effects of an intervention on some measurable outcome primarily in the health domain. | Naive ATE, TARNet, DragonNet, Double Machine Learning, T Learner, S Learner, Inverse Propensity based Learner, PEHE, MAE | 5 | Infant Health and Development Program, Jobs, Twins, Berkeley admission, Government Census, Compas | 2023 |
HV-Ai-C | This repository implements a Reinforcement Learning agent to optimize energy consumption within Data Centers. | RL agents performing Random action, Fixed action, Q Learning, Hyperspace Neighbor Penetration | - | No public datasets available | 2023 |
Flex Model | This repository contains code for the paper FlexModel: A Framework for Interpretability of Distributed Large Language Models. Authors are Matthew Choi, Muhammad Adil Asif, John Willes, David Emerson. |
Distributed Interpretability | - | No public datasets available | 2023 |
VBLL | This repository contains code for the paper Variational Bayesian Last Layers. Authors are James Harrison, John Willes, Jasper Snoek. |
Variational Bayesian Last Layers | 2 | MNIST, FashionMNIST | 2023 |
Recommendation Systems | This repository contains demos for various RecSys techniques such as Collaborative Filtering, Knowledge Graph, RL based, Sequence Aware, Session based etc. | SVD++, NeuMF, Plot based, Two tower, SVD, KG based, SlateQ, BST, Simple Association Rules, first-order Markov Chains, Sequential Rules, RNN, Neural Attentive Session, BERT4rec, A2SVDModel, SLi-Rec | 7 | Amazon-recsys, careervillage, movielens-recsys, tmdb, LastFM, yoochoose | 2022 |
Forecasting with Deep Learning | This repository contains demos for a variety of forecasting techniques for Univariate and Multivariate time series, spatiotemporal forecasting etc. | Exponential Smoothing, Persistence Forecasting, Mean Window Forecast, Prophet, Neuralphophet, NBeats, DeepAR, Autoformer, DLinear, NHITS | 11 | Canadian Weather Station Data, BoC Exchange rate, Electricity Consumption, Road Traffic Occupancy, Influenza-Like Illness Patient Ratios, Walmart M5 Retail Product Sales, WeatherBench, Grocery Store Sales, Economic Data with Food CPI | 2022 |
Prompt Engineering | This repository contains demos for a variety of Prompt Engineering techniques such as fairness measurement via sentiment analysis, finetuning, prompt tuning, prompt ensembling etc. | Bias Quantification & Probing, Stereotypical Bias Analysis, Binary sentiment analysis task, Finetuning using HF Library, Gradient-Search for Instruction Prefix, GRIPS for Instruction Prefix, LLM Summarization, LLM Classification | 10 | Crow-pairs, sst5, [cnn_dailymail], [ag_news], Tweet-data, Other | 2022 |
NAA | This repository contains code for the paper Bringing the State-of-the-Art to Customers: A Neural Agent Assistant Framework for Customer Service Support published at EMNLP'22 in the industry track. Authors are Stephen Obadinma, Faiza Khan Khattak, Shirley Wang, Tania Sidhorn, Elaine Lau, Sean Robertson, Jingcheng Niu, Winnie Au, Alif Munim, Karthik Raja Kalaiselvi Bhaskar. |
Context Retrieval using SBERT bi-encoder, Context Retrieval using SBERT cross-encoder, Intent identification using BERT, Few Shot Multi-Class Text Classification with BERT, Multi-Class Text Classification with BERT, Response generation via GPT2 | 5 | ELI5, MSMARCO | 2022 |
Privacy Enhancing Technologies | This repository contains demos for Privacy, Homomorphic Encryption, Horizontal and Vertical Federated Learning, MIA, and PATE. | Vanilla SGD, DP SGD, DP Logistic Regression, Homomorphic Encryption for MLP, Horizontal FL, Horizontal FL on MLP, Membership Inference Attacks (MIA) using DP, MIA using SAM, PATE, Vertical FL | 9 | Heart Disease, Credit Card Fraud, Breaset Cancer Data, TCGA, CIFAR10, Home Credit Default Risk, Yelp, Airbnb | 2021 |
SSGVQAP | This repository contains code for the paper A Smart System to Generate and Validate Question Answer Pairs for COVID-19 Literature which was accepted in ACL'20. Authors are Rohan Bhambhoria, Luna Feng, Dawn Sepehr, John Chen, Conner Cowling, Sedef Kocak, Elham Dolatabadi. |
An Active Learning Strategy for Data Selection, AL-Uncertainty, AL-Clustering | 1 | CORD-19 | 2021 |
foodprice-forecasting | This repository replicates the experiments described on pages 16 and 17 of the 2022 Edition of Canada's Food Price Report. | Time series forecasting using Prophet, Time series forecasting using Neural prophet, Interpretable time series forecasting using N-BEATS, Ensemble of the above methods | 3 | FRED Economic Data | 2021 |
Computer_Vision_Project | This repository tackles different problems such as defect detection, footprint extraction, road obstacle detection, traffic incident detection, and segmentation of medical procedures. | Semantic segmentation using Unet, Unet++, FCN, DeepLabv3, Anomaly segmentation | 11 | SpaceNet Building Detection V2, MVTEC, ICDAR2015, PASCAL_VOC, DOTA, AVA, UCF101-24, J-HMDB-21 | 2020 |
Note
- Many repositories contain code for reference purposes only. In order to run them, updates may be required to the code and environment files.
- Links for only publicly available datasets are provided. Many datasets used in the repositories are only available on the the Vector cluster.