This repository contains our proof-of-concept for fingerprinting a dataset.
git clone https://github.com/trailofbits/datasig && cd datasig
python3 -m pip install .
The code below shows experimental usage of the library. This will be subject to frequent changes in early development stages.
from torchvision.datasets import MNIST
from datasig.dataset import TorchVisionDataset, CanonicalDataset
torch_dataset = MNIST(root="/tmp/data", train=True, download=True)
dataset = TorchVisionDataset(torch_dataset)
canonical = CanonicalDataset(dataset)
print("Dataset UID: ", canonical.uid)
print("Dataset fingerprint: ", canonical.fingerprint)
Tests are in the datasig/test
directory. You can run the tests with:
python3 -m pytest # Run all tests
python3 -m pytest -s datasig/test/test_csv.py # Run only one test file
python3 -m pytest -s datasig/test/test_csv.py -k test_similarity # Run only one specific test function
The profiling script generates a profile for dataset processing and fingerprint generation using cProfile. To profile the MNIST dataset from the torch framework, you can run:
python3 profiling.py torch_mnist --full
The --full
argument tells the script to include dataset canonization, UID generation, and fingerprint generation in the profile. If you want to profile only some of these steps you can cherry pick by using or omitting the following arguments instead:
python3 profiling.py torch_mnist --canonical --uid --fingerprint
You can optionally specify the datasig config version to use (at the time of writing we have only v0) with:
python3 profiling.py torch_mnist -v 0 --all
Currently we support only one target dataset: torch_mnist
. To add another dataset, add a class in profiling.py
similar to TorchMNISTV0
, that implements the _setup()
method which is responsible for loading the dataset.