Smart contracts, powered by blockchain technology, have revolutionized decentralized applications by enabling trustless and tamper-proof execution of agreements. However, their immutable nature makes them particularly susceptible to vulnerabilities, which can lead to significant financial losses if exploited. Identifying and mitigating these vulnerabilities before deployment is thus a critical challenge in the blockchain ecosystem.
SmartGuard is a multi-stage vulnerability detection framework designed to address this challenge. By leveraging advanced machine learning techniques and publicly available datasets, this project aims to detect vulnerabilities in smart contract code with high accuracy and robustness.
- Multi-Stage Detection: A hierarchical architecture comprising Detector, Reasoner, and Verificator for accurate and interpretable vulnerability detection.
- Custom Data Splitting: Ensures balanced distribution of vulnerabilities across training and testing datasets.
- Feature Extraction: Combines CodeBERT, Longformer, and CodeT5 for semantic and syntactic feature extraction.
- Streamlit Web Application: An interactive interface for exploring datasets, preprocessing steps, and model results.
The preprocessing phase involves:
- Removing missing and duplicate values.
- Consolidating multiple entries of raw code with different vulnerabilities into single entries.
- Transforming the label-encoded column into an array to accommodate multiple vulnerabilities.
To explore effective code representations, we initially attempted to convert Solidity code into both OPCODE and bytecode formats. However, due to resource and time constraints, we opted to preprocess the code by removing comments and newline characters. This simplified representation was used as features for model training.
We explored different strategies for feature extraction:
- Code: Used CodeBERT for semantic and syntactic feature extraction.
- OPCODE: Extracted key parts of the code and applied LSTM or Transformer for feature extraction.
- Bytecodes: Combined CodeBERT-processed code with LSTM or Transformer-processed bytecodes, concatenating their output vectors for model training.
Ultimately, we focused on CodeBERT for feature extraction. However, its max_length=512 limitation led to truncation of tokens exceeding 510. To address this, we incorporated Longformer (supporting up to 4,096 tokens) and experimented with CodeT5, an encoder-decoder model. All models output 768-dimensional vectors, providing a rich foundation for downstream vulnerability detection.
The multi-label nature of the data prevents standard stratified sampling. To ensure balanced distribution of vulnerabilities, we developed a custom data-splitting solution that divides the dataset into 80% training and 20% testing sets, closely matching the target proportions of nine vulnerabilities in both subsets.
The framework consists of three stages:
- VulnScreener: A binary classifier (MLP) that determines the presence of vulnerabilities.
- VulnAnalyzer: A CNN that identifies specific vulnerability types.
- VulnValidator: A Random Forest model that refines the Reasoner’s outputs for improved accuracy.
Model performance is evaluated using a suite of metrics—Confusion Matrix, Accuracy, Recall, Precision, and F1-score—enabling a comprehensive comparison of prediction outcomes across the different code representations and stages. Through this project, we aim not only to achieve high detection accuracy but also to provide insights into the efficacy of various code formats and model architectures for smart contract security. Finally, we discuss potential avenues for future improvement, such as incorporating additional datasets, refining feature extraction techniques, or exploring ensemble methods to further enhance detection capabilities. SmartGuard represents a step forward in building secure and trustworthy smart contract ecosystems.
To install the required dependencies, run:
pip install -r requirements.txtFor setting up the environment using Anaconda, please refer to the Streamlit installation guide.
To download the project, clone the repository using:
git clone https://github.com/Rita94105/Smart_Contract_Vulnerability_DetectorTo run the Streamlit application, execute:
streamlit run app.pyapp.py
home.py
README.md
requirements.txt
.streamlit/
.streamlit/pages.toml
conclusion/
conclusion/future.py
conclusion/metrics.py
data/
data/EDA.py
data/explore.py
feature/
feature/code.py
feature/split.py
model/
model/VulnScreener.py
model/overall.py
model/VulnAnalyzer.py
model/VulnValidator.py
This project is licensed under the MIT License.
- Datasets from Kaggle and Hugging Face
- Streamlit for the web application framework
- Contributors and developers of the libraries used in this project
