Skip to content

Implement native ESRI reader #25241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ljw9111
Copy link
Contributor

@ljw9111 ljw9111 commented Mar 6, 2025

Description

This PR implements the native ESRI reader for reading Esri JSON which can be used for geospatial queries. (NOTE: we only support UTC timezone in this port)

Customer can now submit geospatial query on a table using ESRI serde.

DDL example

CREATE external TABLE earthquakes
(
 earthquake_date string,
 latitude double,
 longitude double,
 depth double,
 magnitude double,
 magtype string,
 mbstations string,
 gap string,
 distance string,
 rms string,
 source string,
 eventid string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION 's3://amzn-s3-demo-bucket/my-query-log/csv/';

CREATE external TABLE IF NOT EXISTS counties
 (
 Name string,
 BoundaryShape binary
 )
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.EsriJsonSerDe'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedEsriJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://amzn-s3-demo-bucket/my-query-log/json/';

Example data is from https://docs.aws.amazon.com/athena/latest/ug/geospatial-example-queries.html

DML example (note: the test table data is large enough (8 MB) and needed multiple pages as it exceeded maximum page size (1 MB))

trino:esri> SELECT c.name,
         ->         COUNT(*) cnt
         -> FROM esri.counties as c
         -> CROSS JOIN esri.earthquakes
         -> WHERE ST_CONTAINS (geometry_from_hadoop_shape(c.boundaryshape), ST_POINT(earthquakes.longitude, earthquakes.latitude))
         -> GROUP BY  c.name
         -> ORDER BY  cnt DESC;
      name       | cnt 
-----------------+-----
 Kern            | 288 
 San Bernardino  | 280 
 Imperial        | 224 
 Inyo            | 160 
 Los Angeles     | 144 
 Monterey        | 112 
 Riverside       | 112 
 Santa Clara     |  96 
 Fresno          |  88 
 San Benito      |  88 
 San Diego       |  56 
 Santa Cruz      |  40 
 San Luis Obispo |  24 
 Ventura         |  24 
 Orange          |  16 
 San Mateo       |   8 
(16 rows)

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( O ) Release notes are required, with the following suggested text:

## Section
* User can submit geospatial query on tables using ESRI json serde. It only supports UTC timezone for consistency. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Mar 6, 2025
@github-actions github-actions bot added the hive Hive connector label Mar 6, 2025
@ljw9111 ljw9111 force-pushed the native-esri-reader branch 3 times, most recently from 1361efe to 7f29bc9 Compare March 7, 2025 00:31
@ljw9111 ljw9111 self-assigned this Mar 7, 2025
@ljw9111 ljw9111 force-pushed the native-esri-reader branch 2 times, most recently from 5b41a78 to 16784ed Compare March 10, 2025 17:52
@@ -17,6 +17,11 @@
</properties>

<dependencies>
<dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that stands out is that the library https://github.com/Esri/geometry-api-java has not received any update for almost a year and its latest release is over 4 years ago.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the community point of view, what would you suggest?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findinpath we already ship esri so this doesn't change anything.

break;
}
catch (ParseException e) {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is intended and we return null for unsupported timestamp format

Copy link
Contributor

@findinpath findinpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After skimming the code and trying TestEsri I definitely understand the purpose of this contribution.

The referenced library geometry-api-java seems not lively anymore.

It would be useful to have a test reading all types.

However before adding any other changes, I think it is worth asking the maintainers @wendigo , @dain whether this contribution is basically fit from a functional perspective to be inclued in the Trino project code.

@ljw9111 ljw9111 force-pushed the native-esri-reader branch from 16784ed to f02ca1b Compare March 10, 2025 19:44
@ljw9111 ljw9111 force-pushed the native-esri-reader branch 3 times, most recently from 06d5677 to d28634f Compare March 17, 2025 14:00
@@ -17,6 +17,11 @@
</properties>

<dependencies>
<dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the community point of view, what would you suggest?

@ljw9111 ljw9111 force-pushed the native-esri-reader branch 2 times, most recently from 8de204a to 87d9d72 Compare March 17, 2025 19:12
@ljw9111 ljw9111 force-pushed the native-esri-reader branch from 87d9d72 to 2a28c44 Compare March 19, 2025 18:30
@ljw9111 ljw9111 force-pushed the native-esri-reader branch from 2a28c44 to a7df4a8 Compare March 21, 2025 16:20
Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a good while reviewing this. Overall I think the approach is sound, but the code is missing defenses against bad data files. In this code we should strive to be bug-for-bug compatible with hive, and this includes handling of "bad" files, because users often rely on these undocumented behaviors.

Additionally, Jackson has some unexpected behaviors when recursing into nested structures, and this code falls into that trap. Specifically, the code isn't properly skipping nexted data which can result in processing inside of objects that is not expected (I had to learn this the hard way a couple of years back). In general, I used (copied) the framework laid out in the Json reader, which handles these issues.

Instead of adding a lot of mundane comments, I just applied them to the code which you can find in this commit dain@8b95a73

Finally, the tests seem to be missing cases for some of the supported attribute types... I see then when running the tests with coverage.

@@ -17,6 +17,11 @@
</properties>

<dependencies>
<dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findinpath we already ship esri so this doesn't change anything.

DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm").withZone(UTC_ZONE),
DATE_FORMATTER);

private final int numColumns;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trino avoids abbreviations where possible, so in this case I would call this columnCount. That said, I don't think this is necessary, you can simply use columnNames.size()

geometryColumn = i;
}
}
this.geometryColumn = geometryColumn;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw an exception if no geometry column is specified?

this.geometryColumn = geometryColumn;
}

private ImmutableMap<String, Integer> createColumnNameToIndexMap()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Trino, return types, function arguments, and class fields should use the generic collection type (e.g. Map) and not the implementation type ImmutableMap

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'd just inline this method

public final class EsriDeserializer
{
private static final VarHandle INT_HANDLE_BIG_ENDIAN = MethodHandles.byteArrayViewVarHandle(int[].class, ByteOrder.BIG_ENDIAN);
private static final ZoneId UTC_ZONE = ZoneId.of("UTC");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a constant for UTC built into Java ZoneOffset.UTC, but static import the field when using it.

continue;
}

if (GEOMETRY_FIELD_NAME.equals(parser.currentName())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if there is a field not named "geometry" or "attributes"? Is it an error, or should it be skipped?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decoding a document like this will fail:

        String json = """
                {
                    "extra-junk": {
                        "geometry": null,
                        "attributes": {
                            "id": 42
                        }
                    },
                    "attributes": {
                        "id": 1
                    },
                    "geometry": {
                        "x": 10,
                        "y": 20
                    }
                }
                """;

Copy link

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

@github-actions github-actions bot added the stale label Apr 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

5 participants