Here, we explain each field in the config files. Note that only the ramia field is new compared to the base config file meant for membership inference attacks.
-
ramia: Configurations of range membership inference attacks
- range_function: the type of range function to be used in running ramia
- sample_size: the number of samples to be taken in each range to compute the range membership score for each range
- radius: the size of the range
- transformations: a list of transformation to be applied to the range center when using the
geometricrange function - mask_model: the name of the llm to replace masked tokens in the sequence when using the
word_replacerange function - mask_tokenizer: the name of the tokenizer of the
mask_modelwhen using theword_replacerange function - num_masks: the number of words to be replaced when using the
word_replacerange function
-
run: Configurations related to this specific run
- random_seed: integer number of specifying random seed. Each run of experiments will use the same random seed.
- log_dir: Path to where all the information will be saved, including models and computed signals. If the directory contains models, these models will be loaded instead of trained. Hence, to run experiments with new models, we need to change the log_dir.
- time_log: Indicate whether to log the time for each step. If
True, a time log will be saved - num_experiments: Number of target models we attack. If it is more than 1, an aggregate report will be generated in the end
-
audit: Configurations related to auditing
- privacy_game: Indicate the type of privacy game/notion. We currently support the
privacy_loss_modelgame. We will add more games in the future. - algorithm: The membership inference attack used for auditing. We currently support the RMIA introduced by Zarifzadeh et al. 2024(https://openreview.net/pdf?id=sT7UJh5CTc)) and the LOSS attack
- num_ref_models: Number of reference models used to audit each target model
- device: The device we want to use for inferring signals and auditing models
- report_log: The folder name where we save the log and auditing report
- batch_size: Batch size for evaluating models and inferring signals.
- data_size: The size of the dataset in auditing. If not specified, the entire dataset is used. Must be an even number. The sampled auditing dataset will contain equal numbers of IN and OUT data samples according to the membership information from the first target model.
- privacy_game: Indicate the type of privacy game/notion. We currently support the
-
train: Configuration related to training
- model_name: The model type. We support CNN, wrn28-1, wrn28-2, wrn28-10, vgg16, mlp, gpt2 and speedyresnet. More model types can be added in
/models/. - tokenizer: The tokenizer type. It can be any tokenizer or local checkpoint supported by the
transformerslibrary. For non-text datasets, this field can be dropped. - device: The device we want to use for training models. Note for
transformers, the behavior from Huggingface'sTrainerclass is to use all GPUs available. - batch_size: Batch size for training models.
- learning_rate: Learning rate for training models.
- weight_decay: Weight decay for training models.
- epochs: Number of epochs for training models.
- optimizer: Optimizer for training models. We support
SGD,Adam,AdamW. More optimizers can be added inget_optimizerintrainers/default_trainer.py. - peft: Configuration related to peft. It can be dropped if not needed.
- model_name: The model type. We support CNN, wrn28-1, wrn28-2, wrn28-10, vgg16, mlp, gpt2 and speedyresnet. More model types can be added in
-
data: Configuration related to datasets
- dataset: The name of the dataset. We support cifar10, cifar100, purchase100 and texas100 and agnews by default.
- data_dir: The directory where the dataset is stored. If the dataset is not found in the directory, it will be downloaded.
- tokenize: Indicate whether to tokenize the dataset. If
True, the dataset will be tokenized using the tokenizer specified in the next field. It can be dropped if not needed. - tokenizer: The tokenizer type. It can be any tokenizer or local checkpoint supported by the
transformerslibrary. For non-text datasets, this field can be dropped.