Skip to content

b0r3k/thesis-robust-spoken-slot-filling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Robust Language Understanding for a Voice-Based Dialogue System

This repository contains the accompanying code for the master thesis Robust Language Understanding for a Voice-Based Dialogue System that explores spoken slot filling in Czech public transport search domain by proposing a novel architecture integrating spoken language understanding with Whisper speech recognition through an additional decoder layer.

Overview

The project addresses the challenge of robust slot filling in voice-based task-oriented dialogue systems, specifically focusing on extracting departure and arrival location information from Czech speech in public transport search scenarios. The main research question investigates how to leverage prior knowledge of possible slot values (from a database) to achieve robust slot filling without fine-tuning the underlying speech recognition model.

Table of Contents

Dataset

The repository includes the loading code for the Slopts Dataset - a Czech spoken slot filling dataset for public transport search domain:

  • Source: Real user interactions with the Alex dialogue system
  • Domain: Public transport route planning in Czech
  • Slots: Location names (departure fs, arrival ts, via points, cities, etc.)
  • Size: Training, development, and test splits with corresponding audio files
  • Format: JSONL files with transcriptions, slot annotations, and audio paths

In order to use the dataset, the actual data files train.tar.gz, dev.tar.gz and test.tar.gz with respective audio recordings from Vystadial need to be added to slopts_dataset/data/. For more details on the dataset preparation, consult the thesis document.

Architecture

The main contribution is a novel architecture that extends Whisper with an additional slot decoder:

SloptsModel

  • Base: Pre-trained Whisper large-v3 Czech model (frozen)
  • Extension: Additional WhisperDecoder layer for slot prediction
  • Input: Audio features + previous system dialogue acts
  • Output: Transcription tokens + corresponding slot labels
  • Training: Only the slot decoder and prediction head are trained

Key Components

  • SloptsModel - Main model architecture
  • SloptsDecoderPredictor - Trainable slot prediction components
  • Trie-based constrained decoding for valid location names
  • Beam search generation with slot-aware decoding

Repository Structure

├── data_preprocessing/          # Data extraction and preprocessing
│   ├── dialmonkey_extract.py   # NLU processing with DialMonkey
│   ├── get_alex_utts.py        # Extract utterances from Alex logs
│   └── label_transcriptions.py # Baseline slot labeling
├── neural/                     # Neural model implementation
│   ├── slopts_model.py         # Main model architecture
│   ├── train.py               # Training script
│   ├── generate.py            # Inference and generation
│   ├── final_eval.py          # Evaluation metrics
│   └── dialmonkey_da.py       # Dialogue act parsing
├── slopts_dataset/            # Dataset implementation
│   ├── slopts_dataset.py      # HuggingFace dataset loader
│   └── data/                  # Dataset files (train/dev/test.jsonl)
└── requirements.txt           # Dependencies

Installation

Install the Python packages listed in requirements.txt. Note that dialmonkey currently has outdated dependency sk-learn in its setup.py. It needs to be replaced with scikit-learn to avoid installation issues.

Usage

Dataset Loading

from datasets import load_dataset
dataset = load_dataset("slopts_dataset/slopts_dataset.py", split="train")

Model Training

cd neural
python train.py

Inference

cd neural
python generate.py <checkpoint.pt> <num_decoder_layers>

Evaluation

cd neural
python final_eval.py <split> <preds_file.jsonl> <decode_tokens?>

Baseline

The repository includes a rule-based baseline using:

  • TinkeredPTCSNLU - Modified DialMonkey NLU
  • Location database lookup with form-to-value mapping
  • Context-aware slot disambiguation based on previous system dialogue acts

Key Features

  • Multilingual Support: Czech language focus with morphological complexity handling
  • Real Data: Authentic user interactions from production dialogue system
  • Constrained Decoding: Trie-based valid location name generation
  • Context Awareness: Previous system dialogue acts inform slot prediction
  • Comprehensive Evaluation: Token-level and value-level slot filling metrics

Results

The proposed architecture unfortunately did not surpass the rule-based baseline, highlighting the challenges of integrating speech recognition with slot filling in a unified neural architecture while maintaining robustness to speech recognition errors.

Citation

If you find any of this useful, please cite the associated master's thesis.

License

The author's contributions are licensed under the MIT license. However, the dataset files come from Vystadial project licensed under CC BY-SA 4.0. This project also uses multiple packages that are licensed under various open-source licenses (MIT, Apache 2.0, BSD-3-Clause, GPL-3.0). Please refer to the respective licenses for more details.

About

How knowledge of possible values can improve spoken slot filling without fine-tuning speech recognition models? Research on public transport data in Czech. Novel architecture using Whisper + additional decoder layer that toggles trie-constrained decoding. Includes dataset loading, model architecture, training, inference and evaluation.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages