Skip to content

koudounasalkis/UnSLU-BENCH

Repository files navigation

Machine Unlearning on Spoken Language Understanding

📖 Overview

This repository contains the detailed experimental results for the paper "Alexa, can you forget me?” Machine Unlearning Benchmark on Spoken Language Understanding, accepted at INTERSPEECH 2025.

paper paper SLU-models-collection UNLEARNING-models-collection

🔗 Table of Contents

⚙️ Experimental Setup

Transformer models. For English datasets (FSC and SLURP), we use wav2vec 2.0 and HuBERT models, while for ITALIC and SpeechMASSIVE we employ the multilingual XLS-R-128 and XLS-R-53 models, with the latter ASR fine-tuned for the target language (i.e., Italian, German, French). On FSC, the models are trained for 2800 steps, with 8 batch size, 10% warmup steps, 4 gradient accumulation steps, 5e-4 initial learning rate, AdamW with weight decay, plateau scheduler, and early stopping criterion. SLURP follows the same configuration, but training lasts for 30000 steps. On ITALIC and SpeechMASSIVE, both XLS-R-128 and specialized XLS-R-53 models are trained with the same configuration, but for 15000 total steps, 1e-4 initial learning rate, and 1 gradient accumulation step.

Unlearning methods. All unlearning methods were tested for a single epoch and with the hyperparameters recommended in the original papers. For example, in cf-k, the entire network was frozen except the last layer (cf-1). With SCRUB, a temperature T = 4.0 was used, and with Bad Teaching, the KL temperature is 1. In UNSIR, the learning rate of the noise was set to 0.01. AdamW was used as the optimizer for all methods to remain compliant with the training.

Datasets. It is possible to obtain the datasets used in the experiments by running the relative notebooks present in the unlearning_datasets_creation folder. Once the datasets are created, they can be found in the data_name folder, where name is the name of the dataset. The datasets are FSC (data_fsc), SLURP* (data_slurp*), ITALIC (data_italic), SpeechMASSIVE (data_sm-de and data_sm-fr).

All experiments are conducted on a single NVIDIA A6000 48GB GPU.


📊 Detailed Results

1. Comparison of unlearning methods

Legend:

  • Bold = best result
  • Underlined = second best

A. FSC

Method F1_Test Acc_Test F1_Forget Acc_Forget MIA GUM Speedup F1_Test Acc_Test F1_Forget Acc_Forget MIA GUM Speedup
wav2vec 2.0 HuBERT
Orig. 0.994 0.994 1.000 1.000 0.508 0.000 1.00× 0.993 0.991 1.000 1.000 0.511 0.000 1.00×
Gold 0.993 0.992 0.997 0.998 0.503 0.000 1.00× 0.991 0.990 0.996 0.998 0.507 0.000 1.00×
FT 0.993 0.993 0.999 0.999 0.504 0.517 7.96× 0.979 0.979 0.993 0.992 0.508 0.514 7.69×
NG 0.987 0.988 0.976 0.984 0.501 0.816 206.9× 0.992 0.990 0.996 0.996 0.514 0.000 201.1×
NG+ 0.994 0.994 0.994 0.996 0.493 0.000 4.03× 0.979 0.983 0.929 0.955 0.510 0.336 3.90×
CF-k 0.994 0.994 1.000 1.000 0.501 0.606 16.97× 0.993 0.991 1.000 1.000 0.505 0.642 26.70×
UNSIR 0.991 0.992 1.000 1.000 0.506 0.447 6.55× 0.994 0.992 0.998 0.998 0.508 0.484 6.38×
BT 0.993 0.994 1.000 1.000 0.508 0.000 4.78× 0.993 0.991 0.999 0.999 0.504 0.363 4.65×
BT-L 0.994 0.994 0.996 0.998 0.506 0.431 5.87× 0.993 0.991 0.997 0.998 0.506 0.464 5.69×
SCRUB 0.994 0.994 1.000 1.000 0.506 0.439 6.21× 0.993 0.992 0.998 0.999 0.508 0.479 6.22×

B. SLURP*

Method F1_Test Acc_Test F1_Forget Acc_Forget MIA GUM Speedup F1_Test Acc_Test F1_Forget Acc_Forget MIA GUM Speedup
wav2vec 2.0 HuBERT
Orig. 0.689 0.815 1.000 0.999 0.628 0.000 1.000× 0.712 0.830 1.000 1.000 0.613 0.000 1.000×
Gold 0.707 0.825 0.711 0.822 0.506 0.000 1.000× 0.704 0.826 0.715 0.821 0.492 0.000 1.000×
FT 0.638 0.750 0.970 0.989 0.648 0.000 83.78× 0.734 0.827 1.000 1.000 0.611 0.088 79.00×
NG 0.695 0.809 0.986 0.988 0.604 0.563 1748× 0.718 0.830 0.959 0.993 0.587 0.587 1654×
NG+ 0.701 0.810 0.995 0.993 0.603 0.446 41.63× 0.630 0.777 0.852 0.913 0.453 0.578 39.30×
CF-k 0.709 0.818 1.000 1.000 0.626 0.089 291.9× 0.715 0.831 1.000 1.000 0.608 0.196 274.2×
UNSIR 0.673 0.801 1.000 1.000 0.637 0.000 64.07× 0.722 0.832 1.000 1.000 0.613 0.000 60.44×
BT 0.710 0.815 0.999 0.999 0.619 0.275 50.35× 0.711 0.830 1.000 1.000 0.613 0.000 47.42×
BT-L 0.680 0.811 0.995 0.998 0.637 0.000 61.74× 0.685 0.792 0.907 0.954 0.558 0.578 58.11×
SCRUB 0.697 0.824 0.999 0.999 0.608 0.429 64.82× 0.704 0.832 1.000 1.000 0.600 0.350 65.40×

C. ITALIC

Method F1_Test Acc_Test F1_Forget Acc_Forget MIA GUM Speedup F1_Test Acc_Test F1_Forget Acc_Forget MIA GUM Speedup
XLS-R 128 XLS-R 53-IT
Orig. 0.688 0.735 0.894 0.965 0.632 0.000 1.000× 0.778 0.837 1.000 1.000 0.615 0.000 1.000×
Gold 0.643 0.709 0.568 0.631 0.532 0.000 1.000× 0.784 0.835 0.736 0.824 0.478 0.000 1.000×
FT 0.638 0.686 0.671 0.739 0.555 0.590 30.80× 0.711 0.778 0.850 0.899 0.550 0.551 31.10×
NG 0.679 0.728 0.868 0.933 0.603 0.646 613.4× 0.590 0.688 0.621 0.767 0.525 0.766 623.0×
NG+ 0.658 0.710 0.001 0.007 0.932 0.000 15.14× 0.743 0.800 0.936 0.950 0.582 0.418 15.37×
CF-k 0.677 0.737 0.871 0.963 0.626 0.253 98.59× 0.781 0.838 1.000 1.000 0.609 0.201 98.99×
UNSIR 0.636 0.701 0.830 0.925 0.621 0.328 22.01× 0.775 0.838 1.000 1.000 0.612 0.109 22.26×
BT 0.683 0.734 0.639 0.720 0.481 0.504 17.90× 0.731 0.803 0.848 0.903 0.557 0.491 17.94×
BT-L 0.686 0.734 0.651 0.749 0.518 0.558 22.02× 0.729 0.801 0.876 0.913 0.564 0.499 22.21×
SCRUB 0.442 0.464 0.357 0.377 0.533 0.536 23.25× 0.770 0.809 0.990 0.982 0.610 0.164 22.66×

D. SpeechMASSIVE (DE)

Method F1_Test Acc_Test F1_Forget Acc_Forget MIA GUM Speedup F1_Test Acc_Test F1_Forget Acc_Forget MIA GUM Speedup
XLS-R 128 XLS-R 53-DE
Orig. 0.584 0.681 0.841 0.938 0.621 0.000 1.000× 0.778 0.804 1.000 1.000 0.622 0.000 1.000×
Gold 0.566 0.672 0.529 0.674 0.513 0.000 1.000× 0.745 0.795 0.706 0.818 0.493 0.000 1.000×
FT 0.498 0.579 0.548 0.694 0.543 0.588 34.34× 0.661 0.729 0.905 0.938 0.585 0.464 17.79×
NG 0.550 0.651 0.726 0.818 0.562 0.797 1078× 0.764 0.796 0.957 0.985 0.587 0.643 558.7×
NG+ 0.540 0.635 0.567 0.700 0.487 0.522 16.89× 0.759 0.789 0.878 0.944 0.568 0.431 8.770×
CF-k 0.587 0.682 0.865 0.941 0.622 0.000 109.9× 0.777 0.803 1.000 1.000 0.616 0.208 56.93×
UNSIR 0.565 0.649 0.788 0.924 0.616 0.197 27.46× 0.785 0.804 1.000 1.000 0.619 0.114 14.23×
BT 0.584 0.682 0.789 0.912 0.582 0.489 20.02× 0.726 0.783 0.945 0.976 0.585 0.418 10.41×
BT-L 0.584 0.681 0.786 0.912 0.576 0.523 24.87× 0.729 0.784 0.948 0.979 0.587 0.434 12.94×
SCRUB 0.584 0.682 0.780 0.918 0.600 0.429 26.86× 0.781 0.800 1.000 1.000 0.615 0.211 13.43×

E. SpeechMASSIVE (FR)

Method F1_Test Acc_Test F1_Forget Acc_Forget MIA GUM Speedup F1_Test Acc_Test F1_Forget Acc_Forget MIA GUM Speedup
XLS-R 128 XLS-R 53-FR
Orig. 0.410 0.543 0.572 0.733 0.629 0.000 1.000× 0.756 0.815 1.000 1.000 0.635 0.000 1.000×
Gold 0.469 0.618 0.460 0.580 0.509 0.000 1.000× 0.772 0.807 0.800 0.825 0.520 0.000 1.000×
FT 0.400 0.527 0.465 0.589 0.539 0.545 18.12× 0.759 0.816 0.974 0.997 0.627 0.255 18.42×
NG 0.317 0.438 0.349 0.491 0.564 0.749 597.3× 0.768 0.815 0.935 0.979 0.617 0.501 610.2×
NG+ 0.382 0.501 0.008 0.028 0.882 0.000 8.900× 0.759 0.807 0.943 0.982 0.620 0.317 9.230×
CF-k 0.436 0.551 0.594 0.767 0.612 0.414 58.23× 0.770 0.815 1.000 1.000 0.624 0.338 58.86×
UNSIR 0.420 0.548 0.591 0.755 0.620 0.259 14.67× 0.768 0.815 1.000 1.000 0.633 0.089 14.94×
BT 0.411 0.544 0.583 0.742 0.597 0.409 10.60× 0.772 0.816 0.981 0.994 0.621 0.317 10.82×
BT-L 0.412 0.543 0.574 0.739 0.591 0.447 13.18× 0.727 0.789 0.981 0.985 0.623 0.306 13.42×
SCRUB 0.409 0.539 0.532 0.702 0.611 0.358 13.68× 0.769 0.814 1.000 1.000 0.633 0.089 13.94×

2. Best Learning Rate (LR) for Unlearning Methods

The following table shows the best LR value that maximizes the Global Unlearning Metric (GUM) proposed in the paper for each method, dataset, and model.

Model fsc HuBERT fsc wav2vec 2.0 slurp HuBERT slurp wav2vec 2.0 italic XLS-R 128 italic XLS-R 53-IT sm-de XLS-R 128- sm-de XLS-R 53-DE sm-fr XLS-R 128 sm-fr XLS-R 53-FR
FT 0.0001 1e-05 1e-05 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 1e-05
NG 1e-06 5e-06 5e-06 5e-06 1e-06 5e-06 5e-06 5e-06 5e-06 5e-06
NG+ 1e-06 1e-06 5e-06 1e-06 1e-06 5e-07 5e-07 1e-06 1e-06 5e-07
CF-k 5e-05 0.0001 0.0001 5e-05 0.0001 0.0001 0.0001 1e-05 0.0001 5e-05
UNSIR 1e-05 0.0001 1e-05 0.0001 0.0001 1e-05 5e-05 1e-05 1e-05 1e-05
BT 1e-06 1e-06 1e-06 5e-06 1e-06 1e-06 1e-06 1e-06 1e-06 5e-06
BT-L 1e-06 5e-07 1e-06 1e-06 1e-06 1e-06 1e-06 1e-06 1e-06 1e-06
SCRUB 5e-06 1e-06 5e-06 5e-06 5e-06 5e-06 5e-06 5e-06 5e-06 5e-07

For further details, please refer to the accompanying paper.

📜 License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

📧 Contact

For any inquiries or feedback, please contact Alkis Koudounas and Claudio Savelli.

📄 Citation

If you find this repository useful, please consider citing our paper:

@inproceedings{koudounas2025unlearning,
  title={"Alexa, can you forget me?" Machine Unlearning Benchmark in Spoken Language Understanding},
  author={Koudounas, Alkis and Savelli, Claudio and Giobergia, Flavio and Baralis, Elena},
  booktitle={Proc. Interspeech 2025}, 
  year={2025},
}

About

This repo contains the code for <<"Alexa, can you forget me?” Machine Unlearning Benchmark on Spoken Language Understanding>>

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors