Skip to content

Surprised by type inference: csvjoin removing underscores in columns containg values like: 1_100 #1246

@joeweaver

Description

@joeweaver

Hello,

I've spent a while figuring out why underscores were getting removed in what I believe is a fairly simple use-case. I'm using csvjoin 2.0.0

Here's a minimal example:

file1.csv contents:

Name,Type
MHYS,foo
JABI,bar

file2.csv contents:

Name,ID
MHYS,100_1
JABI,1030_11

Running csvjoin -c Name file1.csv file2.csv produces the following:

Name,Type ,ID
MHYS,foo,1001
JABI,bar,103011

The underscores in the ID field are getting dropped. I've tracked this down to type inference. Running csvjoin with the --no-inference option produces the desired behaviour.

I was a bit surprised by this, as it seems a bit aggressive of a default inference on what I believe to be a very common text
pattern. I've had my share of being bitten by type inference in the tidyverse and when using pandas, but these sort of fields were never an issue.

Finding a type inference method that handles all situations perfectly is a pipe dream, and I don't have an exact solution, but I'd like to point out that:

  1. This happens silently, I was lucky to catch the error while debugging my data pipeline.
  2. Figuring out the root cause was a bit of a time sink. My main reason for filing this issue is to ensure that even if there isn't a good way to fix the issue, this post may help others searching for 'missing/dropped/removed underscores'.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions