--float-precision Not Being Considered in qsv sqlp #2644

brian-mendicino · 2025-03-31T16:17:13Z

brian-mendicino
Mar 31, 2025

Describe the bug
The --float-precision option in the qsv sqlp command does not appear to be applied when converting CSV files to Parquet format. Despite specifying a precision value (e.g., 16), the resulting Parquet file always uses the default precision of 6, leading to truncated decimal values.

To Reproduce

Run the following command:
qsv sqlp /tmp/table.csv 'select * from table' --float-precision 16 --infer-len 0 --format parquet --output table.parquet
Inspect the resulting table.parquet file.
Observe that decimal values are truncated to 6 decimal places, regardless of the --float-precision value.

Expected behavior
The --float-precision option should ensure that floating-point numbers in the Parquet file retain the specified precision (e.g., 16 decimal places).

Screenshots/Backtrace/Sample Data
If applicable, add screenshots/backtraces/sample data to help explain your problem.

Desktop (please complete the following information):

OS: Debian GNU/Linux 12
qsv Version 3.1.1-mimalloc-polars-0.46.0

Additional Notes
Ideally there would be an option to retain precision, skipping truncation...

Answered by jqnatividad

Apr 1, 2025

@brian-mendicino

I enabled Polars support for the decimal data type, so you can now override the generated schema file to explicitly set precision and scale.

https://github.com/dathere/qsv/pull/2646/files

For example, instead of Float64, set it to a Decimal with precision 16 and scale 10:

{
  "fields": {
    "constant": "String",
    "value": {"Decimal" : [16, 10]}
  }
}

View full answer

jqnatividad · 2025-03-31T18:06:31Z

jqnatividad
Mar 31, 2025
Maintainer

Currently, the sqlp --float-precision option only applies to the CSV output format.

Looking at the Polars parquet writer - https://docs.pola.rs/api/rust/dev/polars/prelude/struct.ParquetWriter.html, it's not currently available.

Will explore setting the precision on the Dataframe and see if its preserved downstream when saving to parquet.

0 replies

brian-mendicino · 2025-03-31T18:26:05Z

brian-mendicino
Mar 31, 2025
Author

Thanks for the clarification.

In the meantime, could you confirm if there are any workarounds or alternative approaches to explicitly control the precision for Parquet output? Like, would defining a schema with specific precision for floating-point columns help?

0 replies

jqnatividad · 2025-03-31T19:04:47Z

jqnatividad
Mar 31, 2025
Maintainer

Your idea to define a schema beforehand may very well work @brian-mendicino

Give it a try! Just be sure you're doing it on the latest release - v3.3.0.

0 replies

rickhg12hs · 2025-03-31T19:19:59Z

rickhg12hs
Mar 31, 2025

ROUND might be useful.

$ cat table.csv 
constant,value
e,2.71828182845904523536
pi,3.14159265358979323844
phi,1.61803398874989484820

$ qsv table table.csv 
constant  value
e         2.71828182845904523536
pi        3.14159265358979323844
phi       1.61803398874989484820

$ qsv sqlp table.csv 'select constant, ROUND(value, 10) from table' --format parquet --output table.parquet --quiet && qsv sqlp SKIP_INPUT "select * from read_parquet('table.parquet')" --quiet | qsv table
constant  value
e         2.7182818285
pi        3.1415926536
phi       1.6180339887

$ qsv --version
qsv 3.3.0-mimalloc-apply;fetch;foreach;geocode;Luau 0.663;prompt;python-3.12.9 (main, Feb  4 2025, 00:00:00) [GCC 14.2.1 20240912 (Red Hat 14.2.1-3)];to;polars-0.46.0:py-1.26.0;self_update-4-4;6.12 GiB-17.02 GiB-3.13 GiB-7.65 GiB (x86_64-unknown-linux-gnu compiled with Rust 1.85) compiled

3 replies

brian-mendicino Apr 2, 2025
Author

@rickhg12hs It appears this maxes out at rounding to 16 decimal places....

rickhg12hs Apr 2, 2025

Yeah, I'm not sure where in the data pipeline number representation is made concrete.

This example is also a bit curious where numbers are changed (only the first 15 digits after the decimal point are correct in the output of qsv sqlp ... without an edited cache-schema) through probably some sort of qsv/pola.rs representation "hallucination". Using "Decimal" in cache-schema helps here, up to its apparent limits ([38, 36]).

$ BC_LINE_LENGTH=120 bc -l <<< 'scale=64;print "constant,value\ne,", e(1), "\npi,", a(1.0)*4.0, "\nphi,", (1.0+sqrt(5.0))/2.0, "\n"' > mc_table.csv

$ cat mc_table.csv 
constant,value
e,2.7182818284590452353602874713526624977572470936999595749669676277
pi,3.1415926535897932384626433832795028841971693993751058209749445920
phi,1.6180339887498948482045868343656381177203091798057628621354486227

$ qsv table mc_table.csv 
constant  value
e         2.7182818284590452353602874713526624977572470936999595749669676277
pi        3.1415926535897932384626433832795028841971693993751058209749445920
phi       1.6180339887498948482045868343656381177203091798057628621354486227

$ qsv sqlp mc_table.csv 'SELECT constant, value::decimal(38,36) FROM mc_table' --quiet | qsv table
constant  value
e         2.718281828459045182028326875131019264
pi        3.141592653589793506553605733748834304
phi       1.618033988749895000290740032944209920

$ cat mc_table.pschema.json 
{
  "fields": {
    "constant": "String",
    "value": { "Decimal": [38, 36] }
  }
}

$ qsv sqlp mc_table.csv 'SELECT constant, value FROM mc_table' --quiet --cache-schema | qsv table
constant  value
e         2.718281828459045235360287471352662497
pi        3.141592653589793238462643383279502884
phi       1.618033988749894848204586834365638117

jqnatividad Apr 3, 2025
Maintainer

@rickhg12hs @brian-mendicino , you may want to weigh in on this polars issue as this is really all being done in the polars engine.

pola-rs/polars#19784

brian-mendicino · 2025-03-31T19:33:51Z

brian-mendicino
Mar 31, 2025
Author

I agree that using ROUND(value, 10) in the SQL query is a viable workaround. It requires a bit of work when dealing with any number of csv files, dynamic headers and dynamic decimal variable lengths, so this is not ideal.

Regarding the schema, I attempted to use a basic schema like the following:

{
  "fields": {
    "constant": "String",
    "value": "Float64"
  }
}

However, this did not resolve the issue because it does not allow specifying the precision or scale for the Float64 type. (version 3.1.1, debian glibc still on 2.36)

Another option may be to coerce decimal fields into strings.

18 replies

jqnatividad Apr 1, 2025
Maintainer

@brian-mendicino

I enabled Polars support for the decimal data type, so you can now override the generated schema file to explicitly set precision and scale.

https://github.com/dathere/qsv/pull/2646/files

For example, instead of Float64, set it to a Decimal with precision 16 and scale 10:

{
  "fields": {
    "constant": "String",
    "value": {"Decimal" : [16, 10]}
  }
}

Answer selected by jqnatividad

brian-mendicino Apr 1, 2025
Author

This looks great. I assume the fix cannot be tested until the dependency is updated? Looking forward to trying it out!

rickhg12hs Apr 1, 2025

Related?

...

failures:

---- test_sqlp::sqlp_boston311_sql_cache_schema_decimal_override stdout ----

thread 'test_sqlp::sqlp_boston311_sql_cache_schema_decimal_override' panicked at tests/test_sqlp.rs:1178:5:
assertion failed: `(left == right)`'
  left: `"{\"latitude\":42.3594,\"longitude\":-71.0587}\n{\"latitude\":42.3634,\"longitude\":-71.0566}\n{\"latitude\":42.2884,\"longitude\":-71.133}\n{\"latitude\":42.3401,\"longitude\":-71.0803}\n{\"latitude\":42.3735,\"longitude\":-..."` (truncated)
 right: `"{\"latitude\":\"42.359\",\"longitude\":\"-71.058700\"}\n{\"latitude\":\"42.363\",\"longitude\":\"-71.056600\"}\n{\"latitude\":\"42.288\",\"longitude\":\"-71.133000\"}\n{\"latitude\":\"42.340\",\"longitude\":\"-71.080300\"}\n{\"latitude\":..."` (truncated)

Differences (-left|+right):
-{"latitude":42.3594,"longitude":-71.0587}
-{"latitude":42.3634,"longitude":-71.0566}
-{"latitude":42.2884,"longitude":-71.133}
-{"latitude":42.3401,"longitude":-71.0803}
-{"latitude":42.3735,"longitude":-71.0599}
-{"latitude":42.3594,"longitude":-71.07}
-{"latitude":42.3118,"longitude":-71.1152}
-{"latitude":42.3606,"longitude":-71.0638}
-{"latitude":42.3596,"longitude":-71.0634}
-{"latitude":42.3494,"longitude":-71.0811}
+{"latitude":"42.359","longitude":"-71.058700"}
+{"latitude":"42.363","longitude":"-71.056600"}
+{"latitude":"42.288","longitude":"-71.133000"}
+{"latitude":"42.340","longitude":"-71.080300"}
+{"latitude":"42.373","longitude":"-71.059900"}
+{"latitude":"42.359","longitude":"-71.070000"}
+{"latitude":"42.311","longitude":"-71.115200"}
+{"latitude":"42.360","longitude":"-71.063800"}
+{"latitude":"42.359","longitude":"-71.063400"}
+{"latitude":"42.349","longitude":"-71.081100"}


note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    test_sqlp::sqlp_boston311_sql_cache_schema_decimal_override

test result: FAILED. 1805 passed; 1 failed; 9 ignored; 0 measured; 0 filtered out; finished in 462.74s

error: test failed, to rerun pass `--test tests`

jqnatividad Apr 1, 2025
Maintainer

@rickhg12hs , I gather you're compiling from source...

What platform are you running this on? Doing compares with floats is always iffy...

rickhg12hs Apr 1, 2025

Yes, I'm compiling from source ...

$ LESS+=R git log -1
commit b35aec7c04b59d02576d53086d33ffd5f104f8f3 (HEAD -> master, origin/master, origin/HEAD)
Author: Joel Natividad <[email protected]>
Date:   Tue Apr 1 08:04:13 2025 -0400

    docs: `sqlp` expanded `sqlp` description in usage text
    
    [skip ci]

$ qsv --version
qsv 3.3.0-mimalloc-apply;fetch;foreach;geocode;Luau 0.663;prompt;python-3.12.9 (main, Feb  4 2025, 00:00:00) [GCC 14.2.1 20240912 (Red Hat 14.2.1-3)];to;polars-0.46.0:py-1.26.0;self_update-4-4;6.12 GiB-16.75 GiB-3.21 GiB-7.65 GiB (x86_64-unknown-linux-gnu compiled with Rust 1.85) compiled

$ uname -a
Linux steely.steelersnet 6.13.8-100.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Mar 23 05:06:02 UTC 2025 x86_64 GNU/Linux

$ cat /etc/redhat-release 
Fedora release 40 (Forty)

brian-mendicino · 2025-04-01T20:54:48Z

brian-mendicino
Apr 1, 2025
Author

@jqnatividad I was able to test with the following setup and observed different results than expected.

Compiled from source

$ cargo build --release --target x86_64-unknown-linux-musl --locked --bin qsv --features feature_capable,polars

$ qsv --version
qsv 3.3.0-mimalloc-polars-0.46.0:py-1.26.0;-12-12;12.48 GiB-2.00 GiB-12.40 GiB-15.60 GiB (x86_64-unknown-linux-musl compiled with Rust 1.85)

Command

$ qsv sqlp table.csv 'select * from table' --cache-schema --format parquet --output table.parquet

Input

$ cat table.pschema.json
{
  "fields": {
    "constant": "String",
    "value": { "Decimal": [16, 10] }
  }
}

$ cat table.csv
constant,value
e,2.71828182845904523536
pi,3.14159265358979323844
phi,1.61803398874989484820

Actual Output

table.parquet
{"constant":"e","value":"\"27182818284\""}
{"constant":"pi","value":"\"31415926535\""}
{"constant":"phi","value":"\"16180339887\""}

9 replies

jqnatividad Apr 7, 2025
Maintainer

With #2647 merged @brian-mendicino , qsv can now work with parquet, ipc/arrow, json/jsonl and gz/zst/zlib compressed files...

qsv lens table.parquet
qsv table table.csv.gz
qsv slice --start 10 --len 20 table.jsonl

brian-mendicino Apr 8, 2025
Author

Quick question - does qsv also handle large decimal values (>16 decimal places) correctly across these formats? For example, if handling high-precision financial data having this capability would be very helpful.

jqnatividad Apr 8, 2025
Maintainer

Commands that use the polars engine should be able to handle high precision values as we enabled polars' dtype-decimal feature and these extended file formats leverage the polars engine.

Right now, the sqlp --cache-schema option leverages the stats cache to infer the column data types. However, stats uses Rust's f64 type, so we're limited to ~16 decimal places for floats.

Still, you can always manually edit the .pschema.json file and manually change the Float64 types to Decimal and sqlp and joinp should be able to handle it via polars (as we do here #2644 (reply in thread))

Give it a try and let me know.

jqnatividad Apr 11, 2025
Maintainer

@brian-mendicino , our discussion inspired me to add this :)
https://github.com/dathere/qsv/pull/2678/files

brian-mendicino Apr 17, 2025
Author

Thank you for the Decimal schema fix!
The solution is working well, though I have some reservations about how parquet Decimal inherently functions, requiring me to set an arbitrarily large scale to prevent rounding, which adds trailing zeros to smaller rows while increasing both parquet file and potentially database size. Despite this drawback, it's solving our problem effectively.

Thanks again for your help with this solution!

ondohotola · 2025-04-08T12:07:15Z

ondohotola
Apr 8, 2025

What would prevent you from creating a suitable test file finding out?

0 replies

--float-precision Not Being Considered in qsv sqlp #2644

Uh oh!

Uh oh!

brian-mendicino Mar 31, 2025

Replies: 7 comments · 30 replies

Uh oh!

jqnatividad Mar 31, 2025 Maintainer

Uh oh!

brian-mendicino Mar 31, 2025 Author

Uh oh!

jqnatividad Mar 31, 2025 Maintainer

Uh oh!

Uh oh!

rickhg12hs Mar 31, 2025

Uh oh!

brian-mendicino Apr 2, 2025 Author

Uh oh!

Uh oh!

rickhg12hs Apr 2, 2025

Uh oh!

jqnatividad Apr 3, 2025 Maintainer

Uh oh!

Uh oh!

brian-mendicino Mar 31, 2025 Author

Uh oh!

jqnatividad Apr 1, 2025 Maintainer

Uh oh!

brian-mendicino Apr 1, 2025 Author

Uh oh!

rickhg12hs Apr 1, 2025

Uh oh!

jqnatividad Apr 1, 2025 Maintainer

Uh oh!

rickhg12hs Apr 1, 2025

Uh oh!

Uh oh!

brian-mendicino Apr 1, 2025 Author

Uh oh!

jqnatividad Apr 7, 2025 Maintainer

Uh oh!

brian-mendicino Apr 8, 2025 Author

Uh oh!

Uh oh!

jqnatividad Apr 8, 2025 Maintainer

Uh oh!

jqnatividad Apr 11, 2025 Maintainer

Uh oh!

brian-mendicino Apr 17, 2025 Author

Uh oh!

ondohotola Apr 8, 2025

brian-mendicino
Mar 31, 2025

Replies: 7 comments 30 replies

jqnatividad
Mar 31, 2025
Maintainer

brian-mendicino
Mar 31, 2025
Author

jqnatividad
Mar 31, 2025
Maintainer

rickhg12hs
Mar 31, 2025

brian-mendicino Apr 2, 2025
Author

jqnatividad Apr 3, 2025
Maintainer

brian-mendicino
Mar 31, 2025
Author

jqnatividad Apr 1, 2025
Maintainer

brian-mendicino Apr 1, 2025
Author

jqnatividad Apr 1, 2025
Maintainer

brian-mendicino
Apr 1, 2025
Author

jqnatividad Apr 7, 2025
Maintainer

brian-mendicino Apr 8, 2025
Author

jqnatividad Apr 8, 2025
Maintainer

jqnatividad Apr 11, 2025
Maintainer

brian-mendicino Apr 17, 2025
Author

ondohotola
Apr 8, 2025