-
Notifications
You must be signed in to change notification settings - Fork 671
Description
This issue has already some track record documented in #1209.
The idea is the following. Let's say I have a long and wide table where I have scattered user feedback and other things mixed into and I'd like to identify columns that mention "service". Then the output of the command should look as follows:
$ command xy "service"
7 567
102 89
3 256
To be read as: In column 567
, the word "service" appeared seven times, in column 89
it appeared 102
times, and in column 256
it appeared 3 times. Now I can investigate each of these columns further.
In #1209 I suggested to add some sort of column numbering directly to all fields to be able to use other csvkit
and shell commands to get this output. @jpmckinney suggested that this can be done with the following command, added it to the documentation (see c4e5612), and closed the PR.
csvgrep -m 222 -a -c 1- examples/realdata/FY09_EDU_Recipients_by_State.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -n1 sh -c 'echo $1 | csvcut -n' sh | grep 222
Momentarily I was happy with the conclusion and the re-use of existing tools.
However, I quickly came to realize that this is not performing well. Here is an example:
Prep:
cp examples/realdata/FY09_EDU_Recipients_by_State.csv test.csv
for i in {1..200}; do tail -n+2 examples/realdata/FY09_EDU_Recipients_by_State.csv; done >> test.csv
This results in little more than 10k lines.
Now I do a search for 679
(our "service" in this case) as per suggested method and time it:
time csvgrep -m 679 -a -c 1- FY09_EDU_Recipients_by_State.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -n1 sh -c 'echo $1 | csvcut -n' sh | grep 679 | sort | uniq -c
201 4: 679
201 6: 679
real 1m4.553s
user 0m57.484s
sys 0m7.391s
Compare this to the suggested method in #1209:
# explain first what -N does:
csvcut -h | grep "\-N"
[-C NOT_COLUMNS] [-x] [-N]
-N, --column-numbers Add column numbering everywhere, in header and cells.
# actual timing:
time csvcut -N ../csvkit/FY09_EDU_Recipients_by_State.csv | sed "s/.*\([0-9]\+~679\).*/\1/gp" -n | sort | uniq -c
201 4~679
201 6~679
real 0m0.236s
user 0m0.237s
sys 0m0.020s
The times are roughly a minute apart and this is not an extremely big file yet (few columns).
Questions to the wider audience:
- Is this a relevant problem for someone else?
- Does anyone have a good and performing solution with existing tools?
- If we were to "add" to
csvkit
- which tool should/could it be?