Robust Language Understanding for a Voice-Based Dialogue System

This repository contains the accompanying code for the master thesis Robust Language Understanding for a Voice-Based Dialogue System that explores spoken slot filling in Czech public transport search domain by proposing a novel architecture integrating spoken language understanding with Whisper speech recognition through an additional decoder layer.

Overview

The project addresses the challenge of robust slot filling in voice-based task-oriented dialogue systems, specifically focusing on extracting departure and arrival location information from Czech speech in public transport search scenarios. The main research question investigates how to leverage prior knowledge of possible slot values (from a database) to achieve robust slot filling without fine-tuning the underlying speech recognition model.

Robust Language Understanding for a Voice-Based Dialogue System

Dataset

The repository includes the loading code for the Slopts Dataset - a Czech spoken slot filling dataset for public transport search domain:

Source: Real user interactions with the Alex dialogue system
Domain: Public transport route planning in Czech
Slots: Location names (departure fs, arrival ts, via points, cities, etc.)
Size: Training, development, and test splits with corresponding audio files
Format: JSONL files with transcriptions, slot annotations, and audio paths

In order to use the dataset, the actual data files train.tar.gz, dev.tar.gz and test.tar.gz with respective audio recordings from Vystadial need to be added to slopts_dataset/data/. For more details on the dataset preparation, consult the thesis document.

Architecture

The main contribution is a novel architecture that extends Whisper with an additional slot decoder:

SloptsModel

Base: Pre-trained Whisper large-v3 Czech model (frozen)
Extension: Additional WhisperDecoder layer for slot prediction
Input: Audio features + previous system dialogue acts
Output: Transcription tokens + corresponding slot labels
Training: Only the slot decoder and prediction head are trained

Key Components

SloptsModel - Main model architecture
SloptsDecoderPredictor - Trainable slot prediction components
Trie-based constrained decoding for valid location names
Beam search generation with slot-aware decoding

Repository Structure

├── data_preprocessing/          # Data extraction and preprocessing
│   ├── dialmonkey_extract.py   # NLU processing with DialMonkey
│   ├── get_alex_utts.py        # Extract utterances from Alex logs
│   └── label_transcriptions.py # Baseline slot labeling
├── neural/                     # Neural model implementation
│   ├── slopts_model.py         # Main model architecture
│   ├── train.py               # Training script
│   ├── generate.py            # Inference and generation
│   ├── final_eval.py          # Evaluation metrics
│   └── dialmonkey_da.py       # Dialogue act parsing
├── slopts_dataset/            # Dataset implementation
│   ├── slopts_dataset.py      # HuggingFace dataset loader
│   └── data/                  # Dataset files (train/dev/test.jsonl)
└── requirements.txt           # Dependencies

Installation

Install the Python packages listed in requirements.txt. Note that dialmonkey currently has outdated dependency sk-learn in its setup.py. It needs to be replaced with scikit-learn to avoid installation issues.

Usage

Dataset Loading

from datasets import load_dataset
dataset = load_dataset("slopts_dataset/slopts_dataset.py", split="train")

Model Training

cd neural
python train.py

Inference

cd neural
python generate.py <checkpoint.pt> <num_decoder_layers>

Evaluation

cd neural
python final_eval.py <split> <preds_file.jsonl> <decode_tokens?>

Baseline

The repository includes a rule-based baseline using:

TinkeredPTCSNLU - Modified DialMonkey NLU
Location database lookup with form-to-value mapping
Context-aware slot disambiguation based on previous system dialogue acts

Key Features

Multilingual Support: Czech language focus with morphological complexity handling
Real Data: Authentic user interactions from production dialogue system
Constrained Decoding: Trie-based valid location name generation
Context Awareness: Previous system dialogue acts inform slot prediction
Comprehensive Evaluation: Token-level and value-level slot filling metrics

Results

The proposed architecture unfortunately did not surpass the rule-based baseline, highlighting the challenges of integrating speech recognition with slot filling in a unified neural architecture while maintaining robustness to speech recognition errors.

Citation

If you find any of this useful, please cite the associated master's thesis.

License

The author's contributions are licensed under the MIT license. However, the dataset files come from Vystadial project licensed under CC BY-SA 4.0. This project also uses multiple packages that are licensed under various open-source licenses (MIT, Apache 2.0, BSD-3-Clause, GPL-3.0). Please refer to the respective licenses for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robust Language Understanding for a Voice-Based Dialogue System

Overview

Table of Contents

Dataset

Architecture

SloptsModel

Key Components

Repository Structure

Installation

Usage

Dataset Loading

Model Training

Inference

Evaluation

Baseline

Key Features

Results

Citation

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Robust Language Understanding for a Voice-Based Dialogue System

Overview

Table of Contents

Dataset

Architecture

SloptsModel

Key Components

Repository Structure

Installation

Usage

Dataset Loading

Model Training

Inference

Evaluation

Baseline

Key Features

Results

Citation

License