Skip to content

Smart contracts are pivotal in blockchain applications but are prone to vulnerabilities that can lead to significant losses. SmartGuard: Multi-Stage Smart Contract Vulnerability Detection tackles this issue by developing a machine learning framework to identify eight vulnerability types using datasets from Kaggle and Hugging Face.

Notifications You must be signed in to change notification settings

Rita94105/Smart_Contract_Vulnerability_Detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 SmartGuard: A Multi-Stage Vulnerability Detection Framework for Smart Contracts

Overview

Smart contracts, powered by blockchain technology, have revolutionized decentralized applications by enabling trustless and tamper-proof execution of agreements. However, their immutable nature makes them particularly susceptible to vulnerabilities, which can lead to significant financial losses if exploited. Identifying and mitigating these vulnerabilities before deployment is thus a critical challenge in the blockchain ecosystem.

SmartGuard is a multi-stage vulnerability detection framework designed to address this challenge. By leveraging advanced machine learning techniques and publicly available datasets, this project aims to detect vulnerabilities in smart contract code with high accuracy and robustness.

Features

  • Multi-Stage Detection: A hierarchical architecture comprising Detector, Reasoner, and Verificator for accurate and interpretable vulnerability detection.
  • Custom Data Splitting: Ensures balanced distribution of vulnerabilities across training and testing datasets.
  • Feature Extraction: Combines CodeBERT, Longformer, and CodeT5 for semantic and syntactic feature extraction.
  • Streamlit Web Application: An interactive interface for exploring datasets, preprocessing steps, and model results.

Methodology

Data Preprocessing

The preprocessing phase involves:

  • Removing missing and duplicate values.
  • Consolidating multiple entries of raw code with different vulnerabilities into single entries.
  • Transforming the label-encoded column into an array to accommodate multiple vulnerabilities.

To explore effective code representations, we initially attempted to convert Solidity code into both OPCODE and bytecode formats. However, due to resource and time constraints, we opted to preprocess the code by removing comments and newline characters. This simplified representation was used as features for model training.

Feature Extraction

We explored different strategies for feature extraction:

  1. Code: Used CodeBERT for semantic and syntactic feature extraction.
  2. OPCODE: Extracted key parts of the code and applied LSTM or Transformer for feature extraction.
  3. Bytecodes: Combined CodeBERT-processed code with LSTM or Transformer-processed bytecodes, concatenating their output vectors for model training.

Ultimately, we focused on CodeBERT for feature extraction. However, its max_length=512 limitation led to truncation of tokens exceeding 510. To address this, we incorporated Longformer (supporting up to 4,096 tokens) and experimented with CodeT5, an encoder-decoder model. All models output 768-dimensional vectors, providing a rich foundation for downstream vulnerability detection.

Custom Data Splitting

The multi-label nature of the data prevents standard stratified sampling. To ensure balanced distribution of vulnerabilities, we developed a custom data-splitting solution that divides the dataset into 80% training and 20% testing sets, closely matching the target proportions of nine vulnerabilities in both subsets.

Model Training

SmartGuard framework

The framework consists of three stages:

  1. VulnScreener: A binary classifier (MLP) that determines the presence of vulnerabilities.
  2. VulnAnalyzer: A CNN that identifies specific vulnerability types.
  3. VulnValidator: A Random Forest model that refines the Reasoner’s outputs for improved accuracy.

Experiments and Results

Model performance is evaluated using a suite of metrics—Confusion Matrix, Accuracy, Recall, Precision, and F1-score—enabling a comprehensive comparison of prediction outcomes across the different code representations and stages. Through this project, we aim not only to achieve high detection accuracy but also to provide insights into the efficacy of various code formats and model architectures for smart contract security. Finally, we discuss potential avenues for future improvement, such as incorporating additional datasets, refining feature extraction techniques, or exploring ensemble methods to further enhance detection capabilities. SmartGuard represents a step forward in building secure and trustworthy smart contract ecosystems.

Installation

To install the required dependencies, run:

pip install -r requirements.txt

For setting up the environment using Anaconda, please refer to the Streamlit installation guide.

Download the Project

To download the project, clone the repository using:

git clone https://github.com/Rita94105/Smart_Contract_Vulnerability_Detector

Running the Project

To run the Streamlit application, execute:

streamlit run app.py

Directory Structure

app.py
home.py
README.md
requirements.txt
.streamlit/
    .streamlit/pages.toml
conclusion/
    conclusion/future.py
    conclusion/metrics.py
data/
    data/EDA.py
    data/explore.py
feature/
    feature/code.py
    feature/split.py
model/
    model/VulnScreener.py
    model/overall.py
    model/VulnAnalyzer.py
    model/VulnValidator.py

License

This project is licensed under the MIT License.

Acknowledgements

  • Datasets from Kaggle and Hugging Face
  • Streamlit for the web application framework
  • Contributors and developers of the libraries used in this project

About

Smart contracts are pivotal in blockchain applications but are prone to vulnerabilities that can lead to significant losses. SmartGuard: Multi-Stage Smart Contract Vulnerability Detection tackles this issue by developing a machine learning framework to identify eight vulnerability types using datasets from Kaggle and Hugging Face.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages