Skip to content

read_fwf() incorrectly parses fixed-width files with multi-byte UTF-8 characters #1580

@alearrigo

Description

@alearrigo

Problem description:

When reading fixed-width files containing multi-byte UTF-8 characters (like °, accented letters), read_fwf() counts character positions instead of byte positions, causing field misalignment in subsequent columns.

# Create a minimal reproducible example
# Each line should be exactly 10 characters long
test_lines <- c(
  "AAAA123456",           # Line 1: 10 ASCII characters
  "BBBB12°456",           # Line 2: 9 characters, but ° is 2 bytes in UTF-8 
  "CCCC123456"            # Line 3: 10 ASCII characters  
)

# Write to temporary file
temp_file <- tempfile(fileext = ".txt")
writeLines(test_lines, temp_file, sep = "\n")

# Define column positions for a 10-character fixed-width format
# Field 1: positions 1-4
# Field 2: positions 5-6 
# Field 3: positions 7-10

# Read with read_fwf
result <- read_fwf(
  temp_file,
  fwf_cols(
    field1 = c(1, 4),
    field2 = c(5, 6),
    field3 = c(7, 10)
  ),
  col_types = cols(.default = col_character())
)

print(result)

# Expected output:
# field1 field2 field3
# AAAA   12     3456
# BBBB   12     °456  
# CCCC   12     3456

# Actual output:
# field1 field2 field3
# AAAA   12     3456
# BBBB   12     °45    # Wrong! field3 should be "°456"
# CCCC   12     3456

# The issue: 
# Line 2 contains "°" which is encoded as 2 bytes in UTF-8 (0xC2 0xB0)
# but read_fwf() counts it as 2 character position instead of 1, causing subsequent 
# field boundaries to shift by 1 position.

# Verification - check actual byte lengths:
cat("Character lengths:\n")
for (i in seq_along(test_lines)) {
  cat(sprintf("Line %d: %d chars, %d bytes\n", 
              i, nchar(test_lines[i]), nchar(test_lines[i], type = "bytes")))
}


sessioninfo::session_info()
─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.2 (2024-10-31)
 os       macOS Sequoia 15.5
 system   aarch64, darwin20
 ui       RStudio
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Rome
 date     2025-07-16
 rstudio  2025.05.0+496 Mariposa Orchid (desktop)
 pandoc   3.2.1 @ /opt/homebrew/bin/pandoc
 quarto   1.6.40 @ /usr/local/bin/quartoPackages ─────────────────────────────────────────────────────────────────────────────────────────────────
 ! package      * version  date (UTC) lib source
 P arrow          20.0.0.2 2025-05-26 [?] CRAN (R 4.4.1)
 P assertthat     0.2.1    2019-03-21 [?] CRAN (R 4.4.0)
   bit            4.6.0    2025-03-06 [1] CRAN (R 4.4.1)
   bit64          4.6.0-1  2025-01-16 [1] CRAN (R 4.4.1)
   cli            3.6.4    2025-02-13 [1] CRAN (R 4.4.1)
   crayon         1.5.3    2024-06-20 [1] CRAN (R 4.4.1)
 P dplyr        * 1.1.4    2023-11-17 [?] RSPM (R 4.4.0)
 P farver         2.1.2    2024-05-13 [?] RSPM (R 4.4.0)
 P forcats      * 1.0.0    2023-01-29 [?] RSPM (R 4.4.0)
 P generics       0.1.4    2025-05-09 [?] CRAN (R 4.4.1)
   ggplot2      * 3.5.0    2024-02-23 [1] CRAN (R 4.3.1)
   glue           1.8.0    2024-09-30 [1] CRAN (R 4.4.1)
 P gtable         0.3.6    2024-10-25 [?] RSPM (R 4.4.0)
 P hms            1.1.3    2023-03-21 [?] CRAN (R 4.4.0)
 P lifecycle      1.0.4    2023-11-07 [?] RSPM (R 4.4.0)
   lubridate    * 1.9.4    2024-12-08 [1] CRAN (R 4.4.1)
 P magrittr       2.0.3    2022-03-30 [?] CRAN (R 4.4.0)
   pillar         1.10.1   2025-01-07 [1] CRAN (R 4.4.1)
 P pkgconfig      2.0.3    2019-09-22 [?] RSPM (R 4.4.0)
   purrr        * 1.0.4    2025-02-05 [1] CRAN (R 4.4.1)
   R6             2.6.1    2025-02-15 [1] CRAN (R 4.4.1)
 P RColorBrewer   1.1-3    2022-04-03 [?] CRAN (R 4.4.0)
   readr        * 2.1.5    2024-01-10 [1] CRAN (R 4.4.0)
   renv           1.0.0    2023-07-07 [1] CRAN (R 4.3.2)
   rlang          1.1.5    2025-01-17 [1] CRAN (R 4.4.1)
 P rstudioapi     0.17.1   2024-10-22 [?] CRAN (R 4.4.1)
 P scales         1.4.0    2025-04-24 [?] CRAN (R 4.4.1)
 P sessioninfo    1.2.3    2025-02-05 [?] CRAN (R 4.4.1)
   stringi        1.8.4    2024-05-06 [1] CRAN (R 4.4.1)
 P stringr      * 1.5.1    2023-11-14 [?] CRAN (R 4.4.0)
 P tibble       * 3.3.0    2025-06-08 [?] CRAN (R 4.4.1)
   tidyr        * 1.3.1    2024-01-24 [1] CRAN (R 4.4.1)
   tidyselect     1.2.1    2024-03-11 [1] CRAN (R 4.4.0)
 P tidyverse    * 2.0.0    2023-02-22 [?] CRAN (R 4.4.0)
   timechange     0.3.0    2024-01-18 [1] CRAN (R 4.4.1)
   tzdb           0.5.0    2025-03-15 [1] CRAN (R 4.4.1)
 P vctrs          0.6.5    2023-12-01 [?] CRAN (R 4.4.0)
 P vroom          1.6.5    2023-12-05 [?] CRAN (R 4.4.0)
   withr          3.0.2    2024-10-28 [1] CRAN (R 4.4.1)

 [1] /Users/alessandroarrigo/Documents/GitHub/VedaWare_Policlinico/renv/library/R-4.4/aarch64-apple-darwin20
 [2] /Users/alessandroarrigo/Library/Caches/org.R-project.R/R/renv/sandbox/R-4.4/aarch64-apple-darwin20/84ba8b13

 * ── Packages attached to the search path.
 P ── Loaded and on-disk path mismatch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions