Skip to content

feat(c/driver/postgresql): optionally return NUMERIC columns as double #1513

Open
@lupko

Description

@lupko

Hi all,

I am currently evaluating/trying out ABDC (Python) in the following context:

  • Have a Flight RPC data service - let's call it 'connector' - that is responsible for handling arbitrary SQL queries to PostgreSQL and producing results as stream of Arrow data
  • The result provided by 'connector' is then used by other services that do additional 'post-SQL' processing; the result of 'connector' is further 'massaged' using different algorithms (typically aggregations or ML algorithms).
  • The PostgreSQL data stored in NUMERIC columns typically represents some measurable value - which is used in the different algorithms applied by the other services.
  • It is a hard constraint that the results produced by the 'connector' can be used as-is in the subsequent steps.

In this context, the current ADBC PostgreSQL behavior where all NUMERIC columns are converted to string is not ideal. I understand the motivation - ADBC does not want to do lossy conversion by default and string is the only safe bet at the moment; I need to find a way out though :)

I was entertaining the idea that the 'connector' talking to PostgreSQL would perform the conversion on the fly so that it produces data where NUMERICs are in the end doubles (e.g. pipe the record batches produced by ADBC via Acero to perform str->double conversion on the way out of the service).

The tricky part here is that i'm unable to make this work in a simple deterministic way with just the SQL 'in hand': ADBC result gives no indication (or I was unable to find it? please let me know) that a string column in the result was in fact created from NUMERIC data.


The ideal solution in my context would be for ADBC PostgreSQL to return NUMERIC columns as double. I read through the issues and found that you have been considering this as optional behavior.

I dug into this a little bit and found in PostgreSQL source that it does numeric -> double conversion by calling strtod on the string representation (e.g. the algorithm that is already ported to ADBC's PostgreSQL driver). So if ADBC was to follow the suit, it could reuse the existing numeric -> string conversion and then do strtod on top of it - it could be quite simple.

For reference, here is a working prototype: lupko@6d52952

  • refactored reader so that the numeric -> str code & friends from PostgreSQL are in common base class
  • there is one class for Numeric to String; as before, the code populates Arrow buffers directly
  • there is another class for Numeric to Double; string representation is written into temporary location from where code does strtod and populates Arrow buffer with the double value

What is missing in the prototype is some logic that decides whether to produce string or double - code in prototype always produces double.

If you find this general direction acceptable, I could take it further and complete it. I just need a little extra guidance about making this optional. At what level should the setting be available? What should be the name? What is the best way to 'tunnel' the option down to the code that decides that Arrow type to use? (e.g. PostgresType::SetSchema seems like a good place - a simple conditional here would be enough; no need to add extra conditions into adbcpq::MakeCopyFieldReader)

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions