Over 90% of speech protocols in SWERIK 1966-2002 matched with media recordings #19
Replies: 4 comments
-
|
Great work Fabian! Looks really promising, and I am looking forward to hearing more about this at the workshop at the Riksdag on 28–29 August! I think that it could also be good to invite you to one of our project meetings to discuss your work and the issue of metadata. Maybe @MansMeg @ninpnin or @BobBorges have some thoughts about the metadata question that Fabian poses at the end? @Lauler – out of curiosity: we know that the MPs' oral speeches both get edited for the printed records, first by the stenographers and then by MP adjustments. This means foremost that speeches get tightened up and grammatically polished. How does your mapping handle this discrepancy? Are you using some kind of Levenstein distance on the word level? |
Beta Was this translation helpful? Give feedback.
-
|
Indeed, these are really good results! I guess the matching issues might be due to issues with mapping and problems with the note seg classification. 1966 and 2002 seem kind of low. Are these edge cases? Do you have any idea about the causes of the matching errors? Yes, we should definitely now pick this issue up and discuss how to include this in a structured way in the corpus. As @fredrik1984 said, it would be great if you could join one of our project meetings. |
Beta Was this translation helpful? Give feedback.
-
|
@fredrik1984 I am using In my case the needle is the (normalized) text of a speech, and the haystack is an automatic transcription of an audio file. This generally works well as it can often find a good alignment even when there are some insertions/deletions/substitutions in the edited protocols. However, the suggested timestamps from this approach are rather approximate due to:
To increase the accuracy of the suggested start and end timestamps of a speech, I also run a speaker diarization model on the audio file. This clusters the different speakers present in the recording and attempts to segment them (i.e. output like The output of this model is postprocessed to contiguous speech segments, and the "dominant" speaker (longest duration) is calculated within the speech segment regions that overlap with the timestamps from the previously mentioned fuzzy string matching. These diarization adjusted timestamps are In general this method produces really good and accurate results, but there are some failure cases.
@MansMeg I'm not entire sure about 1966 and 2002, but I think it can be a case of overcoverage since I can't for certain know which debates/protocols were recorded. The provided 366k total speeches figure is simply all speeches from protocols held on dates between:
Only part of the protocols in 1966 and 2002 are recorded. It's hard to know which without doing time consuming detective work. I have simply set a date filter, and this can be the reason for the "overcoverage". It's not necessarily that the method failed, but rather that many of the speeches I include as "candidates" probably were not recorded at all. I think the number of matched speeches could be improved by investigating which audio files and which protocols got a very low ratio of matches. There are lots of oddities and weird things in the data. For example some dates on the audio files seem to be mistyped due to manual error when creating the files. There are also two audio files with invalid dates that sound like a recording of the devil (I'm guessing either corrupt recording, or the files were mistakenly recorded/digitized in reverse, meaning the people are talking backwards). It's rather time consuming looking at all these individual failure cases. I don't think I will spend that much further time investigating them. If you have someone in your project interested in taking it further I can possibly help them get started. |
Beta Was this translation helpful? Give feedback.
-
|
@Lauler thank you for the thorough comment! I think one reason for the lower results in 1966–1970 is that they did not have microphones installed where MPs sat, only at the speaker's podium. Hence, since MPs were allowed to speak from the benches, it might have been not easy to record those speeches. From the start of the unicameral parliament, I think, they had installed microphones at each chair/bench. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Greetings fellow parliamentary corpus hackers. In my previous attempt at matching the text of speech protocols to media recordings, there were quite a few speeches where I could not find a match.
I'm happy to announce that relaxing the (possible) date ranges for when a speech could have been held has yielded positive results.
About 329k out of 366k speeches in the period 1966-2002 have been matched. From inspecting the data, I'm fairly certain that 95%+ of the estimated timestamps are correct.
I have formatted the metadata into 3 different formats.
Here's some stats on the number of speeches per year in protocols (
nr_speeches_total) and the number of speeches that were matched (nr_matched_speeches):I'll be aligning the text and audio at a more granular level in the coming weeks, with the goal of getting sentence level timestamps.
It would be appreciated if someone from your project could provide help figure out how to properly contribute this metadata to SWERIK. TEI Parlaclarin documentation for timestamps is not my favorite reading material 😄
Beta Was this translation helpful? Give feedback.
All reactions