Skip to content

Script: parsing transcript .srt files into readable text #76

@jdstrausb

Description

@jdstrausb

Hello,

I am working through an online class and trying to produce notes based on the instructional video content. Since many of the concepts covered in these videos are worth taking note of, I'm finding myself writing out nearly every line spoken by the instructor. Obviously, this process is laborious and extremely time-consuming. I am wondering if there is an easier way to extract the text from these videos using an srt tool to help parse and modify the text.

The syntax of the transcript files for each video are identical to standard srt format. Here's an example:

1
00:00:00,710 --> 00:00:03,220
Rob just showed us how we can
make things accessible to

2
00:00:03,220 --> 00:00:05,970
anyone who can't use a mouse or
pointing device.

3
00:00:05,970 --> 00:00:09,130
Whether that's because it's any
type of physical impairment or

4
00:00:09,130 --> 00:00:11,510
a technology issue or
simply personal preference.

Does pysrt currently provide any tools for modifying text content so that it's formatted into a more readable format? To clarify, for the above example, I would like to remove blank lines, lines beginning with the record number and time-stamp, and then join the remaining lines, adding spaces after periods, like so:

Rob just showed us how we can make things accessible to anyone who can't use a mouse or pointing device. Whether that's because it's any type of physical impairment or a technology issue or simply personal preference.

I am interested in creating the following output from the example above and being able to apply such a modification to more of the files in the series. In my current situation, I am really pretty rusty working with python, though believe this capability could be pretty easily implemented with
an understanding of common string methods.

Can anyone contributing to this project let me know how this is done or if the functionality already exists in pysrt?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions