Skip to content

Latest commit

 

History

History
99 lines (70 loc) · 2.99 KB

File metadata and controls

99 lines (70 loc) · 2.99 KB
description Glue Data Connector Documentation

Glue Data Connector

The Glue Data Connector enables federated SQL querying on tables in an AWS Glue Data Catalog.

datasets:
  - from: glue:tpch.lineitem
    name: lineitem
    params:
      glue_region: us-east-1
      glue_key: ${env:SPICE_AWS_KEY}
      glue_secret: ${env:SPICE_AWS_SECRET}

Configuration

from

Specify a table using the format, glue:<database>.<table> by replacing <database> with the name of the Glue database and <table>with the name of the table inside of the <database>.

name

The dataset name. This will be used as the table name within Spice.

Example:

SELECT COUNT(*) FROM lineitem;
+----------+
| count(*) |
+----------+
| 6001215  |
+----------+

params

The following parameters are supported for configuring the connection to the Glue Data Catalog:

Parameter Name Definition
glue_region The AWS region for the Glue Data Catalog. E.g. us-west-2.
glue_key Access key (e.g. AWS_ACCESS_KEY_ID for AWS)
glue_secret Secret key (e.g. AWS_SECRET_ACCESS_KEY for AWS)
glue_session_token Session token (e.g. AWS_SESSION_TOKEN for AWS) for temporary credentials

Authentication

The minimum IAM policy for Glue access is:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetTable",
                "glue:GetTables"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

Limitations

{% hint style="warning" %}

Data Source/Data Format Restrictions

This catalog connector is limited to tables that use the S3 data source. Kinesis and Kafka data sources are not currently supported. Additionally, this catalog connector is currently limited to Iceberg tables, tables with parquet or CSV data format only.

{% endhint %}

{% hint style="warning" %}

Performance Considerations

When using the Glue Data connector without acceleration, data is loaded into memory during query execution. Ensure sufficient memory is available, including overhead for queries and the runtime, especially with concurrent queries.

Memory limitations can be mitigated by storing acceleration data on disk, which is supported by duckdb and sqlite accelerators by specifying mode: file.

Each query retrieves data from the S3 source, which might result in significant network requests and bandwidth consumption. This can affect network performance and incur costs related to data transfer from S3.

{% endhint %}