Setup the environment (pytorch installation might differ depending on your GPU setup):
conda env create --file=env.yaml
conda activate spoofDetect
You also need to install watermark-stealing as a pip package (as instructed below).
The code for watermark-stealing was forked from https://github.com/eth-sri/watermark-stealing/tree/main.
For using a watermark stealing model, additional files need to be downloaded. Refer to watermark-stealing/README.md.
cd watermark-stealing
pip install -e .
Optionally you may use flash attention: flash-attn (pip)
pip install flash-attn --no-build-isolation
To reproduce the experiments from the paper, you need to setup a config file for your model. Then run
bash reprompting_pipeline.sh "path to your config" "Y if Learning, N if Stealing" "number of queries" "dataset (either c4 or dolly)" "Y if generating only spoofed text, N if generating both spoofed and xi-watermarked text"
This will generate the text for both Reprompting and Normal method. To then get p-values run:
python generate_pvalues.py --cfg_path "path to your config" --reprompting "Y/N" --dataset "c4/dolly" --token_target "value for T"
The generated p-values will be in .csv files in the data/pvalues folder.
All the config files used for the experiments in the paper can be found in configs/generated.
Additionally, a configuration file generator is available in generate_configs.ipynb.
Thibaud Gloaguen, [email protected]
Nikola Jovanović, [email protected]
Robin Staab, [email protected]
Martin Vechev
If you use our code please cite the following.
@misc{gloaguen2025discoveringspoofingattemptslanguage,
title={Discovering Spoofing Attempts on Language Model Watermarks},
author={Thibaud Gloaguen and Nikola Jovanović and Robin Staab and Martin Vechev},
year={2025},
eprint={2410.02693},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2410.02693},
}