Skip to content

Captions whose text begins with Line Separator character are parsed as blank string #87

@ontl

Description

@ontl

I occasionally see SRTs in which 1 or 2 captions begin with the Line Separator character, u2028. Those captions get incorrectly parsed as blank.

I believe the character originates in Word, and is carried over when transcript is copy-pasted to YouTube to use YouTube's transcript auto-timing function.

This character seems to act as a normal line break when in the middle or end of a caption; the issue only arises when it is the first character of the caption.

I think the parser to ignore this character.

VLC, for the record, ignores it and displays the caption normally.

Gotchas:
It may make sense to pre-process the file, replacing u2028 with a more compatible line break like \n. We should be careful, though, not to inadvertently trigger the blank line state outlined in Issue 71 by having a caption start with \n.

Example SRT that exhibits this problem:

1
00:00:08,330 --> 00:00:13,653

This caption starts with the character
u2028, which causes PySRT to see it as blank.

2
00:00:13,653 --> 00:00:18,305
This caption has a u2028 here:
 which does not cause issues.

3
00:00:18,305 --> 00:00:22,906

This caption starts with a normal line break; VLC
and PySRT show it as blank as per Issue 71.

Output:

  • Caption 1: VLC displays the caption, PySRT parses it as blank
  • Caption 2: VLC and PySRT display the caption
  • Caption 3: VLC and PySRT show the caption as blank

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions