Skip to content

Performant way of knowing in which column(s) a value (string or number) appears #1214

@athalhammer

Description

@athalhammer

This issue has already some track record documented in #1209.

The idea is the following. Let's say I have a long and wide table where I have scattered user feedback and other things mixed into and I'd like to identify columns that mention "service". Then the output of the command should look as follows:

$ command xy "service"
7 567
102 89
3 256

To be read as: In column 567, the word "service" appeared seven times, in column 89 it appeared 102 times, and in column 256 it appeared 3 times. Now I can investigate each of these columns further.

In #1209 I suggested to add some sort of column numbering directly to all fields to be able to use other csvkit and shell commands to get this output. @jpmckinney suggested that this can be done with the following command, added it to the documentation (see c4e5612), and closed the PR.

csvgrep -m 222 -a -c 1- examples/realdata/FY09_EDU_Recipients_by_State.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -n1 sh -c 'echo $1 | csvcut -n' sh | grep 222

Momentarily I was happy with the conclusion and the re-use of existing tools.

However, I quickly came to realize that this is not performing well. Here is an example:

Prep:

cp examples/realdata/FY09_EDU_Recipients_by_State.csv test.csv
for i in {1..200}; do tail -n+2 examples/realdata/FY09_EDU_Recipients_by_State.csv; done >> test.csv

This results in little more than 10k lines.

Now I do a search for 679 (our "service" in this case) as per suggested method and time it:

time csvgrep -m 679 -a -c 1- FY09_EDU_Recipients_by_State.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -n1 sh -c 'echo $1 | csvcut -n' sh | grep 679 | sort | uniq -c
    201   4: 679
    201   6: 679

real	1m4.553s
user	0m57.484s
sys	0m7.391s

Compare this to the suggested method in #1209:

# explain first what -N does:
csvcut -h | grep "\-N"
              [-C NOT_COLUMNS] [-x] [-N]
  -N, --column-numbers  Add column numbering everywhere, in header and cells.
  
# actual timing:
time csvcut -N ../csvkit/FY09_EDU_Recipients_by_State.csv | sed "s/.*\([0-9]\+~679\).*/\1/gp" -n | sort | uniq -c
    201 4~679
    201 6~679

real	0m0.236s
user	0m0.237s
sys	0m0.020s

The times are roughly a minute apart and this is not an extremely big file yet (few columns).

Questions to the wider audience:

  1. Is this a relevant problem for someone else?
  2. Does anyone have a good and performing solution with existing tools?
  3. If we were to "add" to csvkit - which tool should/could it be?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions