Skip to content

Line number problems in C # notranslate exclusions #323

@aitap

Description

@aitap

The following message seems to be missing from data.table.pot:

https://github.com/Rdatatable/data.table/blob/b7f2106efe038d93577f427f34c06d9c00b4c486/src/fread.c#L2775

The code seems to consider this message to be subject to a # notranslate exclusion:

src_messages = drop_excluded(src_messages, exclusions[is_outside_char_array(exclusion_pos, arrays)])

debug: src_messages = drop_excluded(src_messages, exclusions[is_outside_char_array(exclusion_pos,
    arrays)])
Browse[1]> src_messages[grepl('sep=', msgid)] # <-- row 3 here
                                                                      msgid msgid_plural   fname
                                                                     <char>       <list>  <char>
1:   sep='\\\\n' passed in meaning read lines as single character column\\n       [NULL] DTPRINT
2:                                             sep=',' so dec set to '.'\\n       [NULL] DTPRINT
3:                                                    %8.3fs (%3.0f%%) sep=       [NULL] DTPRINT
                                                                                     call array_start is_marked_for_translation line_number
                                                                                   <char>       <int>                    <lgcl>       <int>
1: DTPRINT(_("  sep='\\\\n' passed in meaning read lines as single character column\\n"))       71163                      TRUE        1674
2:                                           DTPRINT(_("  sep=',' so dec set to '.'\\n"))       83411                      TRUE        1892
3:           DTPRINT(_("%8.3fs (%3.0f%%) sep="), tLayout-tMap, 100.0*(tLayout-tMap)/tTot)      129888                      TRUE        2775
Browse[1]> n
<...>
Browse[1]> src_messages[grepl('sep=', msgid)] # <-- one row less now!
                                                                      msgid msgid_plural   fname
                                                                     <char>       <list>  <char>
1:   sep='\\\\n' passed in meaning read lines as single character column\\n       [NULL] DTPRINT
2:                                             sep=',' so dec set to '.'\\n       [NULL] DTPRINT
                                                                                     call array_start is_marked_for_translation line_number
                                                                                   <char>       <int>                    <lgcl>       <int>
1: DTPRINT(_("  sep='\\\\n' passed in meaning read lines as single character column\\n"))       71163                      TRUE        1674
2:                                           DTPRINT(_("  sep=',' so dec set to '.'\\n"))       83411                      TRUE        1892
Browse[1]> exclusions[is_outside_char_array(exclusion_pos, arrays)]
          file line1 capture_lengths
        <char> <int>           <int>
1: src/fread.c   438               0
2: src/fread.c  1366               0
3: src/fread.c  1733               0
4: src/fread.c  1783               0
5: src/fread.c  2111               0
6: src/fread.c  2119               0
7: src/fread.c  2305               0
8: src/fread.c  2775               0 # <-- why is line 2775 excluded?
9: src/fread.c  2794               0
Browse[1]> readChar(file, file.size(file)) |> substr(exclusion_pos[8]-32, exclusion_pos[8]+16)
[1] "\n      DTPRINT(\"  =====\\n\"); // # notranslate\n   " # <-- exclusion no.8 corresponds to a different line!

Since the exclusions are matched against the original, non-preprocessed file contents:

exclusion_pos = gregexpr("# notranslate( (start|end))?", contents, perl=TRUE)[[1L]]

...and the newlines are matched in the preprocessed file contents, where they have different offsets due to the comments being removed:
contents_char = preprocess(strsplit(contents, NULL)[[1L]])
# as a single string
contents = paste(contents_char, collapse = "")
# NB: should still be fine to look only for \n on windows
newlines_loc = c(0L, as.integer(gregexpr("\n", contents, fixed = TRUE)[[1L]]))

...the line numbers produced from exclusion_pos and newlines_loc end up being incorrect:
exclusions = data.table(
file = file,
line1 = findInterval(as.integer(exclusion_pos), newlines_loc),
capture_lengths = attr(exclusion_pos, "capture.length")[ , 1L]
)

Matching exclusions against the original file would have given the correct line number:

Browse[1]> newlines_loc2 = c(0L, as.integer(gregexpr("\n", readChar(file, file.size(file)), fixed = TRUE)[[1L]]))
Browse[1]> data.table(
      file = file,
      line1 = findInterval(as.integer(exclusion_pos), newlines_loc2),
      capture_lengths = attr(exclusion_pos, "capture.length")[ , 1L]
    )[8]
          file line1 capture_lengths
        <char> <int>           <int>
1: src/fread.c  2113               0
Browse[1]> readLines(file)[2113]
[1] "      DTPRINT(\"  =====\\n\"); // # notranslate"
Browse[1]>

...but there must be a better solution, one that is compatible with preprocessing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions