Performant way of knowing in which column(s) a value (string or number) appears

This issue has already some track record documented in #1209.

The idea is the following. Let's say I have a long and wide table where I have scattered user feedback and other things mixed into and I'd like to identify columns that mention "service". Then the output of the command should look as follows:  

```
$ command xy "service"
7 567
102 89
3 256
```
To be read as: In column `567`, the word "service" appeared seven times, in column `89` it appeared `102` times, and in column `256` it appeared 3 times. Now I can investigate each of these columns further.

In #1209 I suggested to add some sort of column numbering directly to all fields to be able to use other `csvkit` and shell commands to get this output. @jpmckinney suggested that this can be done with the following command, added it to the documentation (see c4e5612814b886b2bedbeda87a45e161557c50aa), and closed the PR.

`csvgrep -m 222 -a -c 1- examples/realdata/FY09_EDU_Recipients_by_State.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -n1 sh -c 'echo $1 | csvcut -n' sh | grep 222`

Momentarily I was happy with the conclusion and the re-use of existing tools.

However, I quickly came to realize that this is not performing well. Here is an example:

Prep:
```
cp examples/realdata/FY09_EDU_Recipients_by_State.csv test.csv
for i in {1..200}; do tail -n+2 examples/realdata/FY09_EDU_Recipients_by_State.csv; done >> test.csv
```

This results in little more than 10k lines.

Now I do a search for `679` (our "service" in this case) as per suggested method and time it:

```
time csvgrep -m 679 -a -c 1- FY09_EDU_Recipients_by_State.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -n1 sh -c 'echo $1 | csvcut -n' sh | grep 679 | sort | uniq -c
    201   4: 679
    201   6: 679

real	1m4.553s
user	0m57.484s
sys	0m7.391s
```

Compare this to the suggested method in #1209:

```
# explain first what -N does:
csvcut -h | grep "\-N"
              [-C NOT_COLUMNS] [-x] [-N]
  -N, --column-numbers  Add column numbering everywhere, in header and cells.
  
# actual timing:
time csvcut -N ../csvkit/FY09_EDU_Recipients_by_State.csv | sed "s/.*$[0-9]\+~679$.*/\1/gp" -n | sort | uniq -c
    201 4~679
    201 6~679

real	0m0.236s
user	0m0.237s
sys	0m0.020s
```

The times are roughly a minute apart and this is not an extremely big file yet (few columns).

**Questions to the wider audience:** 

1. Is this a relevant problem for someone else?
2. Does anyone have a good and performing solution with existing tools?
3. If we were to "add" to `csvkit` - which tool should/could it be?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performant way of knowing in which column(s) a value (string or number) appears #1214

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performant way of knowing in which column(s) a value (string or number) appears #1214

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions