Skip to content

Latest commit

 

History

History
285 lines (232 loc) · 15.7 KB

File metadata and controls

285 lines (232 loc) · 15.7 KB

Audio Tampering Techniques for Deepfake Detection Evaluation

Date: 2025-11-23 Purpose: Evaluate deepfake detection models against realistic audio tampering attacks


Overview

This document describes two complementary audio tampering techniques used to evaluate the robustness of deepfake detection models:

Technique Dataset Files Purpose
Trans-Splicing In-house Trans-Splicing 1,932 Test detection of TTS-generated word insertions
Semantic Tampering In-house Semantic 41 tampered + 9 original Test detection of forensic audio modifications

1. Trans-Splicing Technique

1.1 Overview

Trans-splicing creates tampered audio by replacing specific words in authentic speech with corresponding words generated by Text-to-Speech (TTS) systems. This simulates realistic deepfake attacks where an adversary modifies what a speaker appears to say.

1.2 Technical Process

┌─────────────────────────────────────────────────────────────────────────────┐
│                    TRANS-SPLICING PIPELINE                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐                   │
│  │   TARGET     │    │    TTS       │    │   TAMPERED   │                   │
│  │   AUDIO      │    │   SYSTEM     │    │   OUTPUT     │                   │
│  │              │    │              │    │              │                   │
│  │ "The water   │    │ XTTS/YourTTS │    │ "The [water] │                   │
│  │  is cold"    │    │   Cloning    │    │  is cold"    │                   │
│  └──────┬───────┘    └──────┬───────┘    └──────────────┘                   │
│         │                   │                    ▲                           │
│         ▼                   ▼                    │                           │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐                   │
│  │  WORD-LEVEL  │    │   GENERATE   │    │   PROSODY    │                   │
│  │  ALIGNMENT   │    │   DONOR      │    │   MATCHING   │                   │
│  │  (Whisper)   │    │   WORDS      │    │  & SPLICING  │                   │
│  │              │    │              │    │              │                   │
│  │ t=6.45-6.76s │───▶│  "water"     │───▶│  Crossfade   │                   │
│  │ word="water" │    │  (cloned)    │    │   2-5ms      │                   │
│  └──────────────┘    └──────────────┘    └──────────────┘                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

1.3 Processing Steps

  1. Word-Level Transcription: Whisper ASR identifies word boundaries with timestamps
  2. Candidate Selection: Multiple words selected for replacement (typically 3-7 per file)
  3. TTS Generation: Donor words generated using voice-cloned TTS
  4. Prosody Matching: Donor segments adjusted to match target acoustic properties
  5. Splicing: Crossfade (2-5ms) applied at splice boundaries

1.4 TTS Systems Used

System Description Voice Cloning
XTTS Coqui X-TTS multilingual model Zero-shot cloning from ~10s reference
YourTTS Multi-speaker TTS system Cross-lingual voice conversion

1.5 Processing Variants

Variant Description
Clean Direct TTS output with basic normalization
Unclean Additional processing (noise, compression artifacts)

1.6 Dataset Statistics

Trans-Splicing Dataset
├── xtts-clean/     (506 files)
├── xtts-unclean/   (508 files)
├── yourtts-clean/  (536 files)
└── yourtts-unclean/(382 in protocol, 512 on disk*)
                    ─────────
Total:              1,932 in evaluation protocol (2,062 files on disk)

*130 additional yourtts-unclean files exist on disk but were excluded from the evaluation protocol.

1.7 Sample Metadata Structure

{
  "metadata": {
    "target_copy": "xtts-clean/df_sub099/audio0.wav",
    "donor_copy": "xtts-clean/df_sub099/resampled.wav",
    "replacements": [
      {
        "index": 0,
        "word": "water",
        "target_start": 6.448,
        "target_end": 6.760,
        "donor_start": 370.347,
        "donor_end": 370.680,
        "applied_duration_s": 0.312
      }
    ]
  }
}

2. Semantic Tampering Technique (Tampered_Deepfake Dataset)

2.1 Overview

Semantic tampering modifies audio to change its meaning through linguistically-motivated operations: deletions, insertions, and substitutions. This simulates forensic tampering where evidence audio is altered.

2.2 Technical Process

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SEMANTIC TAMPERING PIPELINE                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  STAGE 1: PREPROCESSING                                                      │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │  Input Audio  ──▶  Resample 16kHz  ──▶  Normalize -20dBFS            │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  STAGE 2: ANALYSIS                                                           │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │  ┌─────────────┐    ┌─────────────────────┐                          │   │
│  │  │   Whisper   │    │  Montreal Forced    │                          │   │
│  │  │    ASR      │───▶│     Aligner         │                          │   │
│  │  │  (GPU)      │    │    (CPU)            │                          │   │
│  │  │             │    │                     │                          │   │
│  │  │ Word-level  │    │  Phoneme-level      │                          │   │
│  │  │ transcript  │    │  timestamps         │                          │   │
│  │  └─────────────┘    └─────────────────────┘                          │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  STAGE 3: CANDIDATE IDENTIFICATION                                           │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐       │   │
│  │  │   spaCy     │    │  Prosody    │    │    Difficulty       │       │   │
│  │  │   NLP       │───▶│  Analysis   │───▶│    Scoring          │       │   │
│  │  │             │    │  (librosa)  │    │                     │       │   │
│  │  │ POS tagging │    │             │    │ Score = Type_Weight │       │   │
│  │  │ Dependency  │    │ F0, Energy  │    │ + Acoustic_Penalty  │       │   │
│  │  │ parsing     │    │ Duration    │    │ + Rhythm_Penalty    │       │   │
│  │  └─────────────┘    └─────────────┘    └─────────────────────┘       │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  STAGE 4: TAMPERING                                                          │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                                                                       │   │
│  │  DELETION         INSERTION         SUBSTITUTION                     │   │
│  │  ─────────        ──────────        ────────────                     │   │
│  │  Remove word      Add "not",        Swap "all"↔"some"                │   │
│  │  at phoneme       "never" at        "many"↔"few"                     │   │
│  │  boundaries       natural breaks                                      │   │
│  │                                                                       │   │
│  │  "always on" ──▶ "on"                                                │   │
│  │  "I agree"   ──▶ "I [not] agree"                                     │   │
│  │  "all clear" ──▶ "some clear"                                        │   │
│  │                                                                       │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

2.3 Tampering Operations

Operation Description Example Type Weight
Deletion Remove adjectives/adverbs "always on" → "on" 1.0
Insertion Add negations "I agree" → "I not agree" 2.0
Substitution Swap quantifiers "all done" → "some done" 2.5

2.4 Difficulty Scoring

Difficulty_Score = Type_Weight + Acoustic_Penalty + Rhythm_Penalty

Where:
- Type_Weight: Base difficulty (Deletion=1.0, Insertion=2.0, Substitution=2.5)
- Acoustic_Penalty: Prosodic discontinuity with neighboring phonemes (0-4)
- Rhythm_Penalty: Duration_ms / 100 (longer = easier to detect)

Classification:
- Easy:   Score < 4.0
- Medium: Score 4.0-8.0
- Hard:   Score > 8.0

2.5 Dataset Statistics

Semantic Tampering Dataset
├── original/    (9 bonafide files)
└── tampered/    (41 tampered files)
                 ─────────
Total:           50 files (for EER computation)

Tampering Types:
└── Deletion: 41 (100% of tampered)

2.6 Sample Metadata Structure

{
  "tamper_filename": "df_sub096_LP1_result_1_T009_M_DEL_back.wav",
  "source_audio": "df_sub096_LP1_result_1.wav",
  "candidate_details": {
    "word": "back",
    "pos_tag": "ADV",
    "type": "adjective_adverb",
    "tamper_type": "deletion",
    "difficulty": {
      "score": 6.91,
      "level": "medium",
      "type_weight": 1.0,
      "acoustic_penalty": 4.0,
      "rhythm_penalty": 1.91
    }
  },
  "tamper_details": {
    "splice_points_ms": [381, 570],
    "original_duration_ms": 13600,
    "tampered_duration_ms": 13406
  }
}

3. Comparison of Techniques

Aspect Trans-Splicing Semantic Tampering
Goal Voice impersonation Evidence manipulation
Scale Multiple words replaced Single word modified
Detection Challenge TTS artifacts, prosody mismatch Subtle acoustic discontinuities
Real-world Application Deepfake audio Forensic tampering
Sample Size 1,932 files 50 files (41 tampered + 9 original)
Has Bonafide Samples No Yes
Evaluation Metric Detection Rate EER

4. Scientific References

Trans-Splicing / TTS Voice Cloning

  • Casanova, E., et al. (2022). "YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone." ICML.
  • Coqui AI. (2023). "XTTS: A Massively Multilingual Zero-Shot Text-to-Speech Model."

Semantic Tampering

  • McAuliffe, M., et al. (2017). "Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi." Interspeech.
  • Malik, H. (2022). "Audio Forensics: Background, Methods, and Future." Multimedia Security.
  • Almutairi, Z., & Elgibreen, H. (2022). "A Review of Modern Audio Deepfake Detection Methods." Applied Sciences.

Audio Analysis Tools

  • McFee, B., et al. (2015). "librosa: Audio and Music Signal Analysis in Python." SciPy.
  • Radford, A., et al. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision." arXiv. (Whisper)

5. Evaluation Methodology

5.1 For Trans-Splicing

Since no bonafide samples are available:

  • Primary Metric: Tamper Detection Rate = % classified as spoof
  • Breakdown: By TTS system (XTTS vs YourTTS) and processing (Clean vs Unclean)

5.2 For Semantic Tampering (Tampered_Deepfake)

With both bonafide and spoof samples:

  • Primary Metric: Equal Error Rate (EER)
  • Breakdown: By tampering type and difficulty level

Document Version: 1.0 Generated: 2025-11-23