Skip to content
This repository was archived by the owner on Jan 15, 2024. It is now read-only.
This repository was archived by the owner on Jan 15, 2024. It is now read-only.

(not a bug) question about bert create_pretraining_data.tokenize_lines() #1592

@kiukchung

Description

@kiukchung

Description

In the function scripts.pretraining.bert.create_pretraining_data.tokenize_lines()

The code snippet:

for line in lines:
        if not line:
            break
        line = line.strip()
        # Empty lines are used as document delimiters
        if not line:
            results.append([])
        else:
            #<OMITTED FOR BREVITY...>
    return results

Suggests that empty or null lines (e.g. "" or None) break the for-loop returning only the lines that have been processed so far whereas stripped-empty lines (e.g. " ") are used as document delimiters.

Could someone shed light as to what the (empty line + break-from-loop) is meant to accomplish? Are empty/null lines used as delimiters?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions