This repository contains the accompanying code for the master thesis Robust Language Understanding for a Voice-Based Dialogue System that explores spoken slot filling in Czech public transport search domain by proposing a novel architecture integrating spoken language understanding with Whisper speech recognition through an additional decoder layer.
The project addresses the challenge of robust slot filling in voice-based task-oriented dialogue systems, specifically focusing on extracting departure and arrival location information from Czech speech in public transport search scenarios. The main research question investigates how to leverage prior knowledge of possible slot values (from a database) to achieve robust slot filling without fine-tuning the underlying speech recognition model.
The repository includes the loading code for the Slopts Dataset - a Czech spoken slot filling dataset for public transport search domain:
- Source: Real user interactions with the Alex dialogue system
- Domain: Public transport route planning in Czech
- Slots: Location names (departure
fs, arrivalts, via points, cities, etc.) - Size: Training, development, and test splits with corresponding audio files
- Format: JSONL files with transcriptions, slot annotations, and audio paths
In order to use the dataset, the actual data files train.tar.gz, dev.tar.gz and test.tar.gz with respective audio recordings from Vystadial need to be added to slopts_dataset/data/. For more details on the dataset preparation, consult the thesis document.
The main contribution is a novel architecture that extends Whisper with an additional slot decoder:
- Base: Pre-trained Whisper large-v3 Czech model (frozen)
- Extension: Additional WhisperDecoder layer for slot prediction
- Input: Audio features + previous system dialogue acts
- Output: Transcription tokens + corresponding slot labels
- Training: Only the slot decoder and prediction head are trained
SloptsModel- Main model architectureSloptsDecoderPredictor- Trainable slot prediction components- Trie-based constrained decoding for valid location names
- Beam search generation with slot-aware decoding
├── data_preprocessing/ # Data extraction and preprocessing
│ ├── dialmonkey_extract.py # NLU processing with DialMonkey
│ ├── get_alex_utts.py # Extract utterances from Alex logs
│ └── label_transcriptions.py # Baseline slot labeling
├── neural/ # Neural model implementation
│ ├── slopts_model.py # Main model architecture
│ ├── train.py # Training script
│ ├── generate.py # Inference and generation
│ ├── final_eval.py # Evaluation metrics
│ └── dialmonkey_da.py # Dialogue act parsing
├── slopts_dataset/ # Dataset implementation
│ ├── slopts_dataset.py # HuggingFace dataset loader
│ └── data/ # Dataset files (train/dev/test.jsonl)
└── requirements.txt # Dependencies
Install the Python packages listed in requirements.txt. Note that dialmonkey currently has outdated dependency sk-learn in its setup.py. It needs to be replaced with scikit-learn to avoid installation issues.
from datasets import load_dataset
dataset = load_dataset("slopts_dataset/slopts_dataset.py", split="train")cd neural
python train.pycd neural
python generate.py <checkpoint.pt> <num_decoder_layers>cd neural
python final_eval.py <split> <preds_file.jsonl> <decode_tokens?>The repository includes a rule-based baseline using:
TinkeredPTCSNLU- Modified DialMonkey NLU- Location database lookup with form-to-value mapping
- Context-aware slot disambiguation based on previous system dialogue acts
- Multilingual Support: Czech language focus with morphological complexity handling
- Real Data: Authentic user interactions from production dialogue system
- Constrained Decoding: Trie-based valid location name generation
- Context Awareness: Previous system dialogue acts inform slot prediction
- Comprehensive Evaluation: Token-level and value-level slot filling metrics
The proposed architecture unfortunately did not surpass the rule-based baseline, highlighting the challenges of integrating speech recognition with slot filling in a unified neural architecture while maintaining robustness to speech recognition errors.
If you find any of this useful, please cite the associated master's thesis.
The author's contributions are licensed under the MIT license. However, the dataset files come from Vystadial project licensed under CC BY-SA 4.0. This project also uses multiple packages that are licensed under various open-source licenses (MIT, Apache 2.0, BSD-3-Clause, GPL-3.0). Please refer to the respective licenses for more details.