Over 90% of speech protocols in SWERIK 1966-2002 matched with media recordings #19

Lauler · 2024-06-27T15:51:13Z

Lauler
Jun 27, 2024

Greetings fellow parliamentary corpus hackers. In my previous attempt at matching the text of speech protocols to media recordings, there were quite a few speeches where I could not find a match.

I'm happy to announce that relaxing the (possible) date ranges for when a speech could have been held has yielded positive results.

About 329k out of 366k speeches in the period 1966-2002 have been matched. From inspecting the data, I'm fairly certain that 95%+ of the estimated timestamps are correct.

I have formatted the metadata into 3 different formats.

Parquet file uploaded as Huggingface dataset: https://huggingface.co/datasets/Lauler/rixvox-alignments . You can browse and preview the data here. There's a description of all the metadata variables as well.
JSON format grouped by protocol. This format may be preferable for you at SWERIK. There is one file per protocol, where the filename of the protocol is the same as your internal protocol ids. The JSON with the protocol contains all the speeches in the protocol (also the ones I failed to match) during the time period along with metadata.
JSON format grouped by audio file. This format may be preferable for The Swedish Parliament. The JSONs are named after the media files, and each file contains metadata about speeches that were found/matched to that media file. This only contains the 329k matched speeches (as the non-matches cannot be assigned a media file).

Here's some stats on the number of speeches per year in protocols (nr_speeches_total) and the number of speeches that were matched (nr_matched_speeches):

year	hours_matched	nr_speeches_total	nr_matched_speeches	match_fraction
1966	152.81	4425	1829	0.41
1967	597.33	9009	7213	0.80
1968	532.13	8391	6831	0.81
1969	569.99	8762	7365	0.84
1970	595.86	8660	7192	0.83
1971	506.13	7567	6844	0.90
1972	540.36	8036	7350	0.91
1973	525.45	7635	6898	0.90
1974	424.68	6888	6037	0.88
1975	468.00	7602	6776	0.89
1976	491.58	7831	7033	0.90
1977	443.38	7620	6881	0.90
1978	543.16	9287	8492	0.91
1979	520.66	9470	8582	0.91
1980	544.95	9867	8978	0.91
1981	469.89	8659	7848	0.91
1982	500.76	8873	7975	0.90
1983	532.66	10451	9192	0.88
1984	523.33	10175	9109	0.90
1985	502.23	9902	8799	0.89
1986	465.03	9838	8286	0.84
1987	453.88	10347	8110	0.78
1988	537.01	10248	9367	0.91
1989	559.53	11680	10617	0.91
1990	555.07	11758	10686	0.91
1991	522.12	11377	10516	0.92
1992	542.95	14271	13337	0.93
1993	551.71	14049	13087	0.93
1994	485.01	11538	10856	0.94
1995	479.64	11619	11329	0.98
1996	464.36	11757	11455	0.97
1997	500.22	12273	11510	0.94
1998	468.92	11677	11102	0.95
1999	517.12	13269	12743	0.96
2000	529.82	13636	12758	0.94
2001	546.92	13579	12974	0.96
2002	140.08	4907	3398	0.69

I'll be aligning the text and audio at a more granular level in the coming weeks, with the goal of getting sentence level timestamps.

It would be appreciated if someone from your project could provide help figure out how to properly contribute this metadata to SWERIK. TEI Parlaclarin documentation for timestamps is not my favorite reading material 😄

fredrik1984 · 2024-06-28T04:40:41Z

fredrik1984
Jun 28, 2024
Maintainer

Great work Fabian! Looks really promising, and I am looking forward to hearing more about this at the workshop at the Riksdag on 28–29 August! I think that it could also be good to invite you to one of our project meetings to discuss your work and the issue of metadata. Maybe @MansMeg @ninpnin or @BobBorges have some thoughts about the metadata question that Fabian poses at the end?

@Lauler – out of curiosity: we know that the MPs' oral speeches both get edited for the printed records, first by the stenographers and then by MP adjustments. This means foremost that speeches get tightened up and grammatically polished. How does your mapping handle this discrepancy? Are you using some kind of Levenstein distance on the word level?

0 replies

MansMeg · 2024-06-28T06:43:05Z

MansMeg
Jun 28, 2024
Maintainer

Indeed, these are really good results! I guess the matching issues might be due to issues with mapping and problems with the note seg classification. 1966 and 2002 seem kind of low. Are these edge cases?

Do you have any idea about the causes of the matching errors?

Yes, we should definitely now pick this issue up and discuss how to include this in a structured way in the corpus. As @fredrik1984 said, it would be great if you could join one of our project meetings.

0 replies

Lauler · 2024-06-28T11:52:32Z

Lauler
Jun 28, 2024
Author

@fredrik1984 I am using fuzz.partial_ratio_alignment which first searches for the optimal alignment of a shorter string (the needle) within a longer string (the haystack), and then calculates the normalized indel distance of the alignment. For very long texts/speeches (>300 words), I split the text and call the function twice. I.e. the first 150 words of a text are used to identify the start of a speech, and the last 150 words to get a match against the end.

In my case the needle is the (normalized) text of a speech, and the haystack is an automatic transcription of an audio file. This generally works well as it can often find a good alignment even when there are some insertions/deletions/substitutions in the edited protocols. However, the suggested timestamps from this approach are rather approximate due to:

Protocols not being verbatim transcriptions, and protocols being edited (as you mention)
The ability of Automatic Speech Recognition to transcribe speech accurately being variable for different speakers, dialects and sociolects.

To increase the accuracy of the suggested start and end timestamps of a speech, I also run a speaker diarization model on the audio file. This clusters the different speakers present in the recording and attempts to segment them (i.e. output like SPEAKER_00: 0.5s-1.8s, SPEAKER_00: 2.1s-4.4s, SPEAKER_01: 6.9s-10.2s, ...).

The output of this model is postprocessed to contiguous speech segments, and the "dominant" speaker (longest duration) is calculated within the speech segment regions that overlap with the timestamps from the previously mentioned fuzzy string matching. These diarization adjusted timestamps are start_segment and end_segment, whereas the timestamps from fuzzy string matching can be found in the variables start_text_time and end_text_time. There is an illustration of the process in this blog post.

In general this method produces really good and accurate results, but there are some failure cases.

The fuzzy string matching is more robust for longer speeches. Very short speeches can be harder to match, and more often lead to false positives.
The false positive rate for short speeches increases the more haystacks I try to find a specific needle in (since the date metadata of audio recordings and text protocols cannot be fully trusted, I try to match a specific speech text against recordings +- 3 weeks around the protocol's stated date).
Diarization can fail and fail to segment a speaker from another. Since the format of debates is generally SPEAKER#1 -> SPEAKER OF THE HOUSE -> SPEAKER#2 this is generally less of an issue, but it can still sometimes occur that all these speakers are identified as the same speaker.
In the 1960s it was not wholly uncommon for the question asker in an interpellation to refer to a question they had submitted in writing for the "background details on this matter" when giving their speech. It appears the people who wrote the protocols back then sometimes added the details that were submitted in writing, even though they were not uttered by the speaker in the interpellation. In these cases I usually get only a match for the beginning or the end of a speech.
The method is sensitive to texts of speeches being incorrectly segmented (incorrect speaker introductions, notes being classified as utterances and vice versa, etc).

@MansMeg I'm not entire sure about 1966 and 2002, but I think it can be a case of overcoverage since I can't for certain know which debates/protocols were recorded. The provided 366k total speeches figure is simply all speeches from protocols held on dates between:

The date of the first detected/matched speech, and subtracting 1 week to be safe.
The date of the last detected/matched speech, and adding 1 week as additional margin.

Only part of the protocols in 1966 and 2002 are recorded. It's hard to know which without doing time consuming detective work. I have simply set a date filter, and this can be the reason for the "overcoverage". It's not necessarily that the method failed, but rather that many of the speeches I include as "candidates" probably were not recorded at all.

I think the number of matched speeches could be improved by investigating which audio files and which protocols got a very low ratio of matches. There are lots of oddities and weird things in the data. For example some dates on the audio files seem to be mistyped due to manual error when creating the files. There are also two audio files with invalid dates that sound like a recording of the devil (I'm guessing either corrupt recording, or the files were mistakenly recorded/digitized in reverse, meaning the people are talking backwards).

It's rather time consuming looking at all these individual failure cases. I don't think I will spend that much further time investigating them. If you have someone in your project interested in taking it further I can possibly help them get started.

0 replies

fredrik1984 · 2024-06-28T12:08:54Z

fredrik1984
Jun 28, 2024
Maintainer

@Lauler thank you for the thorough comment!

I think one reason for the lower results in 1966–1970 is that they did not have microphones installed where MPs sat, only at the speaker's podium. Hence, since MPs were allowed to speak from the benches, it might have been not easy to record those speeches. From the start of the unicameral parliament, I think, they had installed microphones at each chair/bench.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

swerik-project

Over 90% of speech protocols in SWERIK 1966-2002 matched with media recordings #19

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

swerik-project

Over 90% of speech protocols in SWERIK 1966-2002 matched with media recordings #19

Uh oh!

Uh oh!

Lauler Jun 27, 2024

Replies: 4 comments

Uh oh!

fredrik1984 Jun 28, 2024 Maintainer

Uh oh!

Uh oh!

MansMeg Jun 28, 2024 Maintainer

Uh oh!

Uh oh!

Lauler Jun 28, 2024 Author

Uh oh!

fredrik1984 Jun 28, 2024 Maintainer

Lauler
Jun 27, 2024

fredrik1984
Jun 28, 2024
Maintainer

MansMeg
Jun 28, 2024
Maintainer

Lauler
Jun 28, 2024
Author

fredrik1984
Jun 28, 2024
Maintainer