-
Notifications
You must be signed in to change notification settings - Fork 78
Description
Hello,
This is a very neat project! I was thinking "I should collect a bunch of CSV files from the web and do statistics to see what the dialects are, and their predominance, to be able to better detect them" and then I found your paper and Python package! Congrats on this very nice contribution.
I am trying to see how clevercsv performs on FEC data. For instance, let's consider this file:
https://www.fec.gov/files/bulk-downloads/1980/indiv80.zip
$ head -5 fec-indiv-1979–1980.csv
C00078279|A|M11|P|80031492155|22Y||MCKENNON, K R|MIDLAND|MI|00000|||10031979|400|||||CONTRIBUTION REF TO INDIVIDUAL|3062020110011466469
C00078279|A|M11||79031415137|15||OREFFICE, P|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|1500||||||3061920110000382948
C00078279|A|M11||79031415137|15||DOWNEY, J|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|300||||||3061920110000382949
C00078279|A|M11||79031415137|15||BLAIR, E|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|1000||||||3061920110000382950
C00078287|A|Q1||79031231889|15||BLANCHARD, JOHN A|CHICAGO|IL|60685|||03201979|200||||||3061920110000383914When I try to open the file with clevercsv, it takes an inordinate of time, and seems to be hanging. So I tried to use the sniffer as suggested in your example Binder.
# downloaded, unzipped and renamed to a CSV file from:
# https://www.fec.gov/files/bulk-downloads/1980/indiv80.zip
content = open("fec-indiv-1979–1980.csv").read()
clevercsv.Sniffer().sniff(content, verbose=True)It prints out this:
Running normal form detection ...
Not normal, has potential escapechar.
Running data consistency measure ...
and then a while later (a few minutes later) starts printing:
Considering 92 dialects.
SimpleDialect(',', '', ''): P = 22104.867952 T = 0.003101 Q = 68.546737
SimpleDialect(',', '"', ''): P = 13927.762095 T = 0.003668 Q = 51.090510
SimpleDialect(',', '"', '/'): P = 13839.682333 T = 0.002461 Q = 34.060128
SimpleDialect(',', "'", ''): P = 12072.093333 T = 0.003278 Q = 39.571560
SimpleDialect(';', '', ''): P = 106.613556 T = 0.000003 Q = 0.000345
SimpleDialect(';', '"', ''): P = 99.261000 T = 0.000000 Q = 0.000000
SimpleDialect(';', '"', '/'): P = 50.238917 skip.
SimpleDialect(';', "'", ''): P = 49.981222 skip.
SimpleDialect('', '', ''): P = 308.696000 T = 0.000000 Q = 0.000000
SimpleDialect('', '"', ''): P = 194.530000 T = 0.000000 Q = 0.000000
SimpleDialect('', '"', '/'): P = 96.652000 T = 0.000000 Q = 0.000000
SimpleDialect('', "'", ''): P = 144.787000 T = 0.000000 Q = 0.000000
SimpleDialect(' ', '', ''): P = 17818.683137 T = 0.346978 Q = 6182.686103
SimpleDialect(' ', '', '/'): P = 17818.565863 T = 0.346984 Q = 6182.762051
SimpleDialect(' ', '"', ''): P = 11300.749933 T = 0.353544 Q = 3995.309179
SimpleDialect(' ', '"', '/'): P = 10372.973520 T = 0.355343 Q = 3685.960429
SimpleDialect(' ', "'", ''): P = 7231.699311 T = 0.343354 Q = 2483.032090
SimpleDialect(' ', "'", '/'): P = 7231.658120 T = 0.343362 Q = 2483.075319
SimpleDialect('#', '', ''): P = 163.330000 skip.
SimpleDialect('#', '"', ''): P = 103.253000 skip.
SimpleDialect('#', '"', '/'): P = 67.761333 skip.
SimpleDialect('#', "'", ''): P = 78.132000 skip.
SimpleDialect('$', '', ''): P = 155.096500 skip.
SimpleDialect('$', '"', ''): P = 97.764000 skip.
SimpleDialect('$', '"', '/'): P = 64.601000 skip.
SimpleDialect('$', "'", ''): P = 72.892500 skip.
SimpleDialect('%', '', ''): P = 104.950222 skip.
SimpleDialect('%', '', '\\'): P = 104.783889 skip.
SimpleDialect('%', '"', ''): P = 65.896889 skip.
SimpleDialect('%', '"', '/'): P = 65.765333 skip.
SimpleDialect('%', '"', '\\'): P = 65.730556 skip.
SimpleDialect('%', "'", ''): P = 49.648556 skip.
SimpleDialect('%', "'", '\\'): P = 49.482222 skip.
SimpleDialect('&', '', ''): P = 2940.570750 skip.
SimpleDialect('&', '', '/'): P = 2940.446000 skip.
SimpleDialect('&', '"', ''): P = 1936.209667 skip.
SimpleDialect('&', '"', '/'): P = 1441.305200 skip.
SimpleDialect('&', "'", ''): P = 1340.900250 skip.
SimpleDialect('&', "'", '/'): P = 1340.775500 skip.
SimpleDialect('*', '', ''): P = 156.344000 skip.
SimpleDialect('*', '"', ''): P = 97.514500 skip.
SimpleDialect('*', '"', '/'): P = 65.599000 skip.
SimpleDialect('*', "'", ''): P = 73.142000 skip.
SimpleDialect('+', '', ''): P = 156.344000 skip.
SimpleDialect('+', '', '\\'): P = 156.094500 skip.
SimpleDialect('+', '"', ''): P = 99.011500 skip.
SimpleDialect('+', '"', '/'): P = 65.266333 skip.
SimpleDialect('+', '"', '\\'): P = 99.011500 skip.
SimpleDialect('+', "'", ''): P = 73.890500 skip.
SimpleDialect('+', "'", '\\'): P = 73.890500 skip.
SimpleDialect('-', '', ''): P = 1456.570500 skip.
SimpleDialect('-', '"', ''): P = 914.921750 skip.
SimpleDialect('-', '"', '/'): P = 700.916267 skip.
SimpleDialect('-', "'", ''): P = 687.513667 skip.
SimpleDialect(':', '', ''): P = 155.096500 skip.
SimpleDialect(':', '"', ''): P = 97.514500 skip.
SimpleDialect(':', '"', '/'): P = 64.933667 skip.
SimpleDialect('<', '', ''): P = 155.845000 skip.
SimpleDialect('<', '"', ''): P = 97.764000 skip.
SimpleDialect('<', '"', '/'): P = 65.100000 skip.
SimpleDialect('<', "'", ''): P = 73.391500 skip.
SimpleDialect('?', '', ''): P = 155.595500 skip.
SimpleDialect('?', '"', ''): P = 98.512500 skip.
SimpleDialect('?', '"', '/'): P = 96.652000 skip.
SimpleDialect('@', '', ''): P = 156.344000 skip.
SimpleDialect('@', '"', ''): P = 98.762000 skip.
SimpleDialect('@', '"', '/'): P = 64.767333 skip.
SimpleDialect('@', "'", ''): P = 73.391500 skip.
SimpleDialect('\\', '', ''): P = 105.282889 skip.
SimpleDialect('\\', '"', ''): P = 66.063222 skip.
SimpleDialect('\\', '"', '/'): P = 66.098000 skip.
SimpleDialect('\\', "'", ''): P = 74.140000 skip.
SimpleDialect('^', '', ''): P = 154.597500 skip.
SimpleDialect('^', '"', ''): P = 97.514500 skip.
SimpleDialect('^', '"', '/'): P = 64.601000 skip.
SimpleDialect('^', "'", ''): P = 72.643000 skip.
SimpleDialect('_', '', ''): P = 156.094500 skip.
SimpleDialect('_', '"', ''): P = 98.013500 skip.
SimpleDialect('_', '"', '/'): P = 65.100000 skip.
SimpleDialect('_', "'", ''): P = 73.391500 skip.
SimpleDialect('|', '', ''): P = 293996.190476 T = 0.946519 Q = 278273.106576
SimpleDialect('|', '', '/'): P = 146998.094048 skip.
SimpleDialect('|', '', '@'): P = 146998.094048 skip.
SimpleDialect('|', '', '\\'): P = 146998.092857 skip.
SimpleDialect('|', '"', ''): P = 185266.666667 skip.
SimpleDialect('|', '"', '/'): P = 46024.763214 skip.
SimpleDialect('|', '"', '@'): P = 92633.332143 skip.
SimpleDialect('|', '"', '\\'): P = 92633.330952 skip.
SimpleDialect('|', "'", ''): P = 12535.572981 skip.
SimpleDialect('|', "'", '/'): P = 12535.572981 skip.
SimpleDialect('|', "'", '@'): P = 12535.572765 skip.
SimpleDialect('|', "'", '\\'): P = 12535.572981 skip.
I would say this takes about 30 minutes, and finally it concludes:
SimpleDialect('|', '', '')
I think I understand what's going on: You designed this for small-ish datasets, and so you reprocess the whole file for every dialog to determine what makes most sense.
I would be tempted to think this is because I feed the data as a variable content, following your example, rather than provide the filename directly. However when I tried to read the read_csv method directly with the filename, it also was really, very, very slow. So I think in all situations currently, clevercsv trips on this file, and more generally this type of file.
When I take the initiative to truncate the data arbitrarily, clevercsv works beautifully. But shouldn't the truncating be something the library, as opposed to the user does?
clevercsv.Sniffer().sniff(content[0:1000], verbose=True)provides in a few seconds:
Running normal form detection ...
Didn't match any normal forms.
Running data consistency measure ...
Considering 4 dialects.
SimpleDialect(',', '', ''): P = 4.500000 T = 0.000000 Q = 0.000000
SimpleDialect('', '', ''): P = 0.009000 T = 0.000000 Q = 0.000000
SimpleDialect(' ', '', ''): P = 1.562500 T = 0.312500 Q = 0.488281
SimpleDialect('|', '', ''): P = 8.571429 T = 0.952381 Q = 8.163265
SimpleDialect('|', '', '')
@GjjvdBurg If this is not a known problem, may I suggest using some variation of "infinite binary search"?
- We start with a small size, and we truncate the file to that small size.
- The detected dialect may be wrong because we are only working on a small subset of the file, so we double the amount of content that we provide the sniffer, and see if it answers the same thing.
- We repeat a predetermined number of times (for instance 4 times), until we've asserted the sniffer has detected the same dialect for larger portions of the file.
I have implemented this algorithm here:
https://gist.github.com/jlumbroso/c123a30a2380b58989c7b12fe4b4f49e
When I run it on the above mentioned file, it immediately (without futzing) produces the correct answer:
In[3]: probe_sniff(content)
Out[3]: SimpleDialect('|', '', '')And on the off-chance you would like me to add this algorithm to the codebase, where would it go?