Date: 2025-11-23 Purpose: Evaluate deepfake detection models against realistic audio tampering attacks
This document describes two complementary audio tampering techniques used to evaluate the robustness of deepfake detection models:
| Technique | Dataset | Files | Purpose |
|---|---|---|---|
| Trans-Splicing | In-house Trans-Splicing | 1,932 | Test detection of TTS-generated word insertions |
| Semantic Tampering | In-house Semantic | 41 tampered + 9 original | Test detection of forensic audio modifications |
Trans-splicing creates tampered audio by replacing specific words in authentic speech with corresponding words generated by Text-to-Speech (TTS) systems. This simulates realistic deepfake attacks where an adversary modifies what a speaker appears to say.
┌─────────────────────────────────────────────────────────────────────────────┐
│ TRANS-SPLICING PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ TARGET │ │ TTS │ │ TAMPERED │ │
│ │ AUDIO │ │ SYSTEM │ │ OUTPUT │ │
│ │ │ │ │ │ │ │
│ │ "The water │ │ XTTS/YourTTS │ │ "The [water] │ │
│ │ is cold" │ │ Cloning │ │ is cold" │ │
│ └──────┬───────┘ └──────┬───────┘ └──────────────┘ │
│ │ │ ▲ │
│ ▼ ▼ │ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ WORD-LEVEL │ │ GENERATE │ │ PROSODY │ │
│ │ ALIGNMENT │ │ DONOR │ │ MATCHING │ │
│ │ (Whisper) │ │ WORDS │ │ & SPLICING │ │
│ │ │ │ │ │ │ │
│ │ t=6.45-6.76s │───▶│ "water" │───▶│ Crossfade │ │
│ │ word="water" │ │ (cloned) │ │ 2-5ms │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
- Word-Level Transcription: Whisper ASR identifies word boundaries with timestamps
- Candidate Selection: Multiple words selected for replacement (typically 3-7 per file)
- TTS Generation: Donor words generated using voice-cloned TTS
- Prosody Matching: Donor segments adjusted to match target acoustic properties
- Splicing: Crossfade (2-5ms) applied at splice boundaries
| System | Description | Voice Cloning |
|---|---|---|
| XTTS | Coqui X-TTS multilingual model | Zero-shot cloning from ~10s reference |
| YourTTS | Multi-speaker TTS system | Cross-lingual voice conversion |
| Variant | Description |
|---|---|
| Clean | Direct TTS output with basic normalization |
| Unclean | Additional processing (noise, compression artifacts) |
Trans-Splicing Dataset
├── xtts-clean/ (506 files)
├── xtts-unclean/ (508 files)
├── yourtts-clean/ (536 files)
└── yourtts-unclean/(382 in protocol, 512 on disk*)
─────────
Total: 1,932 in evaluation protocol (2,062 files on disk)
*130 additional yourtts-unclean files exist on disk but were excluded from the evaluation protocol.
{
"metadata": {
"target_copy": "xtts-clean/df_sub099/audio0.wav",
"donor_copy": "xtts-clean/df_sub099/resampled.wav",
"replacements": [
{
"index": 0,
"word": "water",
"target_start": 6.448,
"target_end": 6.760,
"donor_start": 370.347,
"donor_end": 370.680,
"applied_duration_s": 0.312
}
]
}
}Semantic tampering modifies audio to change its meaning through linguistically-motivated operations: deletions, insertions, and substitutions. This simulates forensic tampering where evidence audio is altered.
┌─────────────────────────────────────────────────────────────────────────────┐
│ SEMANTIC TAMPERING PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ STAGE 1: PREPROCESSING │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ Input Audio ──▶ Resample 16kHz ──▶ Normalize -20dBFS │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
│ STAGE 2: ANALYSIS │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ ┌─────────────┐ ┌─────────────────────┐ │ │
│ │ │ Whisper │ │ Montreal Forced │ │ │
│ │ │ ASR │───▶│ Aligner │ │ │
│ │ │ (GPU) │ │ (CPU) │ │ │
│ │ │ │ │ │ │ │
│ │ │ Word-level │ │ Phoneme-level │ │ │
│ │ │ transcript │ │ timestamps │ │ │
│ │ └─────────────┘ └─────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
│ STAGE 3: CANDIDATE IDENTIFICATION │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │
│ │ │ spaCy │ │ Prosody │ │ Difficulty │ │ │
│ │ │ NLP │───▶│ Analysis │───▶│ Scoring │ │ │
│ │ │ │ │ (librosa) │ │ │ │ │
│ │ │ POS tagging │ │ │ │ Score = Type_Weight │ │ │
│ │ │ Dependency │ │ F0, Energy │ │ + Acoustic_Penalty │ │ │
│ │ │ parsing │ │ Duration │ │ + Rhythm_Penalty │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
│ STAGE 4: TAMPERING │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ DELETION INSERTION SUBSTITUTION │ │
│ │ ───────── ────────── ──────────── │ │
│ │ Remove word Add "not", Swap "all"↔"some" │ │
│ │ at phoneme "never" at "many"↔"few" │ │
│ │ boundaries natural breaks │ │
│ │ │ │
│ │ "always on" ──▶ "on" │ │
│ │ "I agree" ──▶ "I [not] agree" │ │
│ │ "all clear" ──▶ "some clear" │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
| Operation | Description | Example | Type Weight |
|---|---|---|---|
| Deletion | Remove adjectives/adverbs | "always on" → "on" | 1.0 |
| Insertion | Add negations | "I agree" → "I not agree" | 2.0 |
| Substitution | Swap quantifiers | "all done" → "some done" | 2.5 |
Difficulty_Score = Type_Weight + Acoustic_Penalty + Rhythm_Penalty
Where:
- Type_Weight: Base difficulty (Deletion=1.0, Insertion=2.0, Substitution=2.5)
- Acoustic_Penalty: Prosodic discontinuity with neighboring phonemes (0-4)
- Rhythm_Penalty: Duration_ms / 100 (longer = easier to detect)
Classification:
- Easy: Score < 4.0
- Medium: Score 4.0-8.0
- Hard: Score > 8.0
Semantic Tampering Dataset
├── original/ (9 bonafide files)
└── tampered/ (41 tampered files)
─────────
Total: 50 files (for EER computation)
Tampering Types:
└── Deletion: 41 (100% of tampered)
{
"tamper_filename": "df_sub096_LP1_result_1_T009_M_DEL_back.wav",
"source_audio": "df_sub096_LP1_result_1.wav",
"candidate_details": {
"word": "back",
"pos_tag": "ADV",
"type": "adjective_adverb",
"tamper_type": "deletion",
"difficulty": {
"score": 6.91,
"level": "medium",
"type_weight": 1.0,
"acoustic_penalty": 4.0,
"rhythm_penalty": 1.91
}
},
"tamper_details": {
"splice_points_ms": [381, 570],
"original_duration_ms": 13600,
"tampered_duration_ms": 13406
}
}| Aspect | Trans-Splicing | Semantic Tampering |
|---|---|---|
| Goal | Voice impersonation | Evidence manipulation |
| Scale | Multiple words replaced | Single word modified |
| Detection Challenge | TTS artifacts, prosody mismatch | Subtle acoustic discontinuities |
| Real-world Application | Deepfake audio | Forensic tampering |
| Sample Size | 1,932 files | 50 files (41 tampered + 9 original) |
| Has Bonafide Samples | No | Yes |
| Evaluation Metric | Detection Rate | EER |
- Casanova, E., et al. (2022). "YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone." ICML.
- Coqui AI. (2023). "XTTS: A Massively Multilingual Zero-Shot Text-to-Speech Model."
- McAuliffe, M., et al. (2017). "Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi." Interspeech.
- Malik, H. (2022). "Audio Forensics: Background, Methods, and Future." Multimedia Security.
- Almutairi, Z., & Elgibreen, H. (2022). "A Review of Modern Audio Deepfake Detection Methods." Applied Sciences.
- McFee, B., et al. (2015). "librosa: Audio and Music Signal Analysis in Python." SciPy.
- Radford, A., et al. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision." arXiv. (Whisper)
Since no bonafide samples are available:
- Primary Metric: Tamper Detection Rate = % classified as spoof
- Breakdown: By TTS system (XTTS vs YourTTS) and processing (Clean vs Unclean)
With both bonafide and spoof samples:
- Primary Metric: Equal Error Rate (EER)
- Breakdown: By tampering type and difficulty level
Document Version: 1.0 Generated: 2025-11-23