Skip to content

Commit afda3db

Browse files
authored
Document Interval operations limitation and better Tableau support (#16)
* Document Interval operations limitation * Enable running Calcs and Staples tests * Add dialect improvements, fix Staples and Calcs schema * Add workaround for ISDATE and other functions
1 parent 41a754a commit afda3db

14 files changed

Lines changed: 1140 additions & 98 deletions

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ logs
99

1010
# Python virtual environment
1111
.venv
12+
__pycache__
1213

1314
# Build artifacts
1415
*.taco

README.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,22 @@ make install
4444
make run-tableau-allow-unsigned
4545
```
4646

47+
## Limitations
48+
49+
### Multiplication and Division on Intervals Are Not Supported
50+
51+
Multiplying or dividing intervals is not supported and will result in a `Cannot coerce arithmetic expression` error. For example:
52+
53+
```sql
54+
SELECT "orders"."order_date" + "orders"."delivery_days" * INTERVAL '1 DAY'
55+
```
56+
57+
```text
58+
Error during planning: Cannot coerce arithmetic expression Int64 * Interval(MonthDayNano) to valid types
59+
```
60+
61+
The limitation is due to limited arithmetic operations support for Interval by DataFusion, tracked as [apache/datafusion#13850](https://github.com/apache/arrow-datafusion/issues/13850).
62+
4763
## Development
4864

4965
### Prerequisites

spice_jdbc/dialect.tdd

Lines changed: 28 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,30 @@
11
<?xml version="1.0" encoding="UTF-8"?>
22
<dialect name="SpiceDialect" base="PostgreSQL90Dialect" class="spice_jdbc" version="18.1">
3-
4-
</dialect>
3+
<function-map>
4+
<!-- Override default mapping to`VARIANCE` that is not available in DF -->
5+
<function group="aggregate" name="VAR" return-type="real">
6+
<!-- https://datafusion.apache.org/user-guide/sql/aggregate_functions.html#var -->
7+
<formula>VAR(%1)</formula>
8+
<unagg-formula>NULL</unagg-formula>
9+
<argument type="real" />
10+
</function>
11+
<function group='numeric' name='SIGN' return-type='int'>
12+
<!-- https://datafusion.apache.org/user-guide/sql/scalar_functions.html#signum -->
13+
<formula>CAST(SIGNUM(%1) AS SMALLINT)</formula>
14+
<argument type='real' />
15+
</function>
16+
<!-- Override %1^2 which is not supported -->
17+
<function group='numeric' name='SQUARE' return-type='real'>
18+
<formula>((%1)*(%1))</formula>
19+
<argument type='real' />
20+
</function>
21+
<function group='numeric' name='SQUARE' return-type='int'>
22+
<formula>((%1)*(%1))</formula>
23+
<argument type='int' />
24+
</function>
25+
<function group='date' name='ISDATE' return-type='bool'>
26+
<formula>(TRY_CAST(%1 AS DATE) IS NOT NULL)</formula>
27+
<argument type='str' />
28+
</function>
29+
</function-map>
30+
</dialect>

tdvt/.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
/test_results_combined.csv
2+
/test_metadata.csv
23
/tdvt_output_combined.json
34
/tdvt_log_combined.txt
45
/tdvt_actuals_combined.zip
56
/tabquery_logs.zip
7+
/*.twb
8+
/*.twbr

tdvt/TestV1/Staples_utf8_headers.csv

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
"Item Count","Ship Priority","Order Priority","Order Status","Order Quantity","Sales Total","Discount","Tax Rate","Ship Mode","Fill Time","Gross Profit","Price","Ship Handle Cost","Employee Name","Employee Dept","Manager Name","Employee Yrs Exp","Employee Salary","Customer Name","Customer State","Call Center Region","Customer Balance","Customer Segment","Prod Type1","Prod Type2","Prod Type3","Prod Type4","Product Name","Product Container","Ship Promo","Supplier Name","Supplier Balance","Supplier Region","Supplier State","Order ID","Order Year","Order Month","Order Day","Order Date","Order Quarter","Product Base Margin","Product ID","Receive Time","Received Date","Ship Date","Ship Charge","Total Cycle Time","Product In Stock","PID","Market Segment"
2-
1,0,1-URGENT,O,11,766.85,0.07,0.02,DELIVERY TRUCK,1,33.61,70.9800,26.2000,Harold Pretty,1004,"Carpenter, Jan",4,56950.0000,Shui Tom,WASHINGTON,WEST,3652,HOME OFFICE,FURNITURE,BOOKCASES,METAL BOOKCASES,METAL BOOKCASES,"Safco Value Mate Series Steel Bookcases, Baked Enamel Finish on Steel, Gray",JUMBO BOX,REGULAR SHIPPING,Supplier_042,6565,EAST,DELAWARE,4097,2002,5,24,2002-05-24 00:00:00,Q2,0.57,1006,7,2002-06-01 00:00:00,2002-05-25 00:00:00,26.2000,8,YES,49239,HOME OFFICE
1+
"Item Count","Ship Priority","Order Priority","Order Status","Order Quantity","Sales Total","Discount","Tax Rate","Ship Mode","Fill Time","Gross Profit","Price","Ship Handle Cost","Employee Name","Employee Dept","Manager Name","Employee Yrs Exp","Employee Salary","Customer Name","Customer State","Call Center Region","Customer Balance","Customer Segment","Prod Type1","Prod Type2","Prod Type3","Prod Type4","Product Name","Product Container","Ship Promo","Supplier Name","Supplier Balance","Supplier Region","Supplier State","Order ID","Order Year","Order Month","Order Day","Order Date","Order Quarter","Product Base Margin","Product ID","Receive Time","Received Date","Ship Date","Ship Charge","Total Cycle Time","Product In Stock","PID","Market Segment"
2+
1,0,1-URGENT,O,11,766.85,0.07,0.02,DELIVERY TRUCK,1,33.61,70.9800,26.2000,Harold Pretty,1004,"Carpenter, Jan",4,56950.0000,Shui Tom,WASHINGTON,WEST,3652.00,HOME OFFICE,FURNITURE,BOOKCASES,METAL BOOKCASES,METAL BOOKCASES,"Safco Value Mate Series Steel Bookcases, Baked Enamel Finish on Steel, Gray",JUMBO BOX,REGULAR SHIPPING,Supplier_042,6565,EAST,DELAWARE,4097,2002,5,24,2002-05-24 00:00:00,Q2,0.57,1006,7,2002-06-01 00:00:00,2002-05-25 00:00:00,26.2000,8,YES,49239,HOME OFFICE
33
1,0,1-URGENT,O,21,76.29,0,0.05,REGULAR AIR,2,-45.65,3.2800,3.9700,Harold Pretty,1004,"Carpenter, Jan",4,56950.0000,Shui Tom,WASHINGTON,WEST,3652,HOME OFFICE,OFFICE SUPPLIES,PENS & ART SUPPLIES,ART SUPPLIES,COLORED PENS,Newell 342,WRAP BAG,REGULAR SHIPPING,Supplier_071,8180,WEST,WASHINGTON,4097,2002,5,24,2002-05-24 00:00:00,Q2,0.56,342,4,2002-05-30 00:00:00,2002-05-26 00:00:00,3.9700,6,YES,49240,HOME OFFICE
44
1,0,1-URGENT,O,37,758.02,0.07,0.05,REGULAR AIR,1,431.2,20.9800,1.4900,Harold Pretty,1004,"Carpenter, Jan",4,56950.0000,Shui Tom,WASHINGTON,WEST,3652,HOME OFFICE,OFFICE SUPPLIES,BINDERS AND BINDER ACCESSORIES,ROUND RING BINDERS,ROUND RING BINDERS,Avery Legal 4-Ring Binder,SMALL BOX,FREE SHIPPING,Supplier_068,5119,WEST,CALIFORNIA,4097,2002,5,24,2002-05-24 00:00:00,Q2,0.35,1587,2,2002-05-27 00:00:00,2002-05-25 00:00:00,0.0000,3,YES,49241,HOME OFFICE
55
1,0,1-URGENT,O,25,407.75,0,0.03,REGULAR AIR,1,-82.84,15.4200,10.6800,Leslie Monsalve-Jones,1007,"Zingarella, Rosie",11,59850.0000,David Wiener,COLORADO,WEST,4606,HOME OFFICE,OFFICE SUPPLIES,STORAGE & ORGANIZATION,PORTABLE STORAGE,PORTABLE STORAGE,"Decoflex Hanging Personal Folder File, Blue",SMALL BOX,REGULAR SHIPPING,Supplier_080,-40,WEST,CALIFORNIA,33856,2002,5,25,2002-05-25 00:00:00,Q2,0.58,395,3,2002-05-29 00:00:00,2002-05-26 00:00:00,10.6800,4,YES,49242,HOME OFFICE

tdvt/TestV1/arrow_utils.py

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
import pyarrow as pa
2+
import pyarrow.csv as csv
3+
import pyarrow.parquet as pq
4+
5+
def read_csv_with_schema(csv_path, schema, skip_header=True, delimiter=",", quote_char='"'):
6+
"""
7+
Read a CSV file using a specified PyArrow schema.
8+
9+
Args:
10+
csv_path (str): Path to the CSV file
11+
schema (pa.Schema): PyArrow schema to apply
12+
skip_header (bool): Whether to skip the header row
13+
delimiter (str): CSV delimiter character
14+
quote_char (str): CSV quote character
15+
16+
Returns:
17+
pa.Table: PyArrow table with the specified schema
18+
"""
19+
read_options = csv.ReadOptions(
20+
skip_rows=1 if skip_header else 0,
21+
column_names=schema.names
22+
)
23+
24+
parse_options = csv.ParseOptions(
25+
delimiter=delimiter,
26+
quote_char=quote_char
27+
)
28+
29+
convert_options = csv.ConvertOptions(
30+
column_types={field.name: field.type for field in schema},
31+
strings_can_be_null=True,
32+
auto_dict_encode=True,
33+
timestamp_parsers=["%Y-%m-%d", "%Y-%m-%d %H:%M:%S", "%H:%M:%S"]
34+
)
35+
36+
return csv.read_csv(
37+
csv_path,
38+
read_options=read_options,
39+
parse_options=parse_options,
40+
convert_options=convert_options
41+
)
42+
43+
def write_table_to_parquet(table, parquet_path):
44+
"""
45+
Write a PyArrow table to a Parquet file.
46+
47+
Args:
48+
table (pa.Table): PyArrow table to write
49+
parquet_path (str): Output Parquet file path
50+
51+
Returns:
52+
str: Path to the created Parquet file
53+
"""
54+
pq.write_table(table, parquet_path)
55+
return parquet_path
56+
57+
def print_parquet_schema(parquet_path):
58+
"""
59+
Print the schema of a Parquet file.
60+
61+
Args:
62+
parquet_path (str): Path to the Parquet file
63+
"""
64+
schema = pq.read_schema(parquet_path)
65+
print(f"Schema in Parquet file: {parquet_path}")
66+
for field in schema:
67+
print(f" {field.name}: {field.type}")
68+
69+
return schema

tdvt/TestV1/calcs.parquet

8.11 KB
Binary file not shown.
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
import pyarrow.csv as csv
2+
import pyarrow as pa
3+
import pyarrow.parquet as pq
4+
from datetime import datetime
5+
6+
from arrow_utils import print_parquet_schema, read_csv_with_schema, write_table_to_parquet
7+
8+
# CSV input path and Parquet output path
9+
csv_path = "./Calcs_headers.csv"
10+
parquet_path = "calcs.parquet"
11+
12+
# Target Calcs schema: https://github.com/tableau/connector-plugin-sdk/blob/master/tests/datasets/TestV1/DDL/Calcs.sql
13+
arrow_schema = pa.schema([
14+
("key", pa.string()),
15+
("num0", pa.float64()),
16+
("num1", pa.float64()),
17+
("num2", pa.float64()),
18+
("num3", pa.float64()),
19+
("num4", pa.float64()),
20+
("str0", pa.string()),
21+
("str1", pa.string()),
22+
("str2", pa.string()),
23+
("str3", pa.string()),
24+
("int0", pa.int32()),
25+
("int1", pa.int32()),
26+
("int2", pa.int32()),
27+
("int3", pa.int32()),
28+
("bool0", pa.bool_()),
29+
("bool1", pa.bool_()),
30+
("bool2", pa.bool_()),
31+
("bool3", pa.bool_()),
32+
("date0", pa.date32()),
33+
("date1", pa.date32()),
34+
("date2", pa.date32()),
35+
("date3", pa.date32()),
36+
("time0", pa.timestamp("s")),
37+
("time1", pa.time64("us")),
38+
("datetime0", pa.timestamp("s")),
39+
("datetime1", pa.string()),
40+
("zzz", pa.string())
41+
])
42+
43+
try:
44+
table = read_csv_with_schema(csv_path, arrow_schema)
45+
write_table_to_parquet(table, parquet_path)
46+
print_parquet_schema(parquet_path)
47+
48+
except Exception as e:
49+
print(f"Error during conversion: {e}")
50+
51+
# Keeping Table Arrow schema for future reference / troubleshooting
52+
# Schema in Parquet file:
53+
# key: string
54+
# num0: double
55+
# num1: double
56+
# num2: double
57+
# num3: double
58+
# num4: double
59+
# str0: string
60+
# str1: string
61+
# str2: string
62+
# str3: string
63+
# int0: int32
64+
# int1: int32
65+
# int2: int32
66+
# int3: int32
67+
# bool0: bool
68+
# bool1: bool
69+
# bool2: bool
70+
# bool3: bool
71+
# date0: date32[day]
72+
# date1: date32[day]
73+
# date2: date32[day]
74+
# date3: date32[day]
75+
# time0: timestamp[ms]
76+
# time1: time64[us]
77+
# datetime0: timestamp[ms]
78+
# datetime1: string
79+
# zzz: string

tdvt/TestV1/staples.parquet

2.66 MB
Binary file not shown.
Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
import pyarrow.csv as csv
2+
import pyarrow as pa
3+
import pyarrow.parquet as pq
4+
from datetime import datetime
5+
6+
from arrow_utils import print_parquet_schema, read_csv_with_schema, write_table_to_parquet
7+
8+
# CSV input path and Parquet output path
9+
csv_path = "./Staples_utf8_headers.csv"
10+
parquet_path = "staples.parquet"
11+
12+
# Target Staples schema: https://github.com/tableau/connector-plugin-sdk/blob/master/tests/datasets/TestV1/DDL/Staples.sql
13+
arrow_schema = pa.schema([
14+
("Item Count", pa.int32()),
15+
("Ship Priority", pa.string()),
16+
("Order Priority", pa.string()),
17+
("Order Status", pa.string()),
18+
("Order Quantity", pa.float64()),
19+
("Sales Total", pa.float64()),
20+
("Discount", pa.float64()),
21+
("Tax Rate", pa.float64()),
22+
("Ship Mode", pa.string()),
23+
("Fill Time", pa.float64()),
24+
("Gross Profit", pa.float64()),
25+
("Price", pa.decimal128(18, 4)),
26+
("Ship Handle Cost", pa.decimal128(18, 4)),
27+
("Employee Name", pa.string()),
28+
("Employee Dept", pa.string()),
29+
("Manager Name", pa.string()),
30+
("Employee Yrs Exp", pa.float64()),
31+
("Employee Salary", pa.decimal128(18, 4)),
32+
("Customer Name", pa.string()),
33+
("Customer State", pa.string()),
34+
("Call Center Region", pa.string()),
35+
("Customer Balance", pa.float64()),
36+
("Customer Segment", pa.string()),
37+
("Prod Type1", pa.string()),
38+
("Prod Type2", pa.string()),
39+
("Prod Type3", pa.string()),
40+
("Prod Type4", pa.string()),
41+
("Product Name", pa.string()),
42+
("Product Container", pa.string()),
43+
("Ship Promo", pa.string()),
44+
("Supplier Name", pa.string()),
45+
("Supplier Balance", pa.float64()),
46+
("Supplier Region", pa.string()),
47+
("Supplier State", pa.string()),
48+
("Order ID", pa.string()),
49+
("Order Year", pa.int32()),
50+
("Order Month", pa.int32()),
51+
("Order Day", pa.int32()),
52+
("Order Date", pa.timestamp("s")),
53+
("Order Quarter", pa.string()),
54+
("Product Base Margin", pa.float64()),
55+
("Product ID", pa.string()),
56+
("Receive Time", pa.float64()),
57+
("Received Date", pa.timestamp("s")),
58+
("Ship Date", pa.timestamp("s")),
59+
("Ship Charge", pa.decimal128(18, 4)),
60+
("Total Cycle Time", pa.float64()),
61+
("Product In Stock", pa.string()),
62+
("PID", pa.int32()),
63+
("Market Segment", pa.string())
64+
])
65+
66+
try:
67+
table = read_csv_with_schema(csv_path, arrow_schema)
68+
write_table_to_parquet(table, parquet_path)
69+
print_parquet_schema(parquet_path)
70+
71+
except Exception as e:
72+
print(f"Error during conversion: {e}")
73+
74+
# Keeping Table Arrow schema for future reference / troubleshooting
75+
# Schema in Parquet file:
76+
# Item Count: int32
77+
# Ship Priority: string
78+
# Order Priority: string
79+
# Order Status: string
80+
# Order Quantity: double
81+
# Sales Total: double
82+
# Discount: double
83+
# Tax Rate: double
84+
# Ship Mode: string
85+
# Fill Time: double
86+
# Gross Profit: double
87+
# Price: decimal128(18, 4)
88+
# Ship Handle Cost: decimal128(18, 4)
89+
# Employee Name: string
90+
# Employee Dept: string
91+
# Manager Name: string
92+
# Employee Yrs Exp: double
93+
# Employee Salary: decimal128(18, 4)
94+
# Customer Name: string
95+
# Customer State: string
96+
# Call Center Region: string
97+
# Customer Balance: double
98+
# Customer Segment: string
99+
# Prod Type1: string
100+
# Prod Type2: string
101+
# Prod Type3: string
102+
# Prod Type4: string
103+
# Product Name: string
104+
# Product Container: string
105+
# Ship Promo: string
106+
# Supplier Name: string
107+
# Supplier Balance: double
108+
# Supplier Region: string
109+
# Supplier State: string
110+
# Order ID: string
111+
# Order Year: int32
112+
# Order Month: int32
113+
# Order Day: int32
114+
# Order Date: timestamp[ms]
115+
# Order Quarter: string
116+
# Product Base Margin: double
117+
# Product ID: string
118+
# Receive Time: double
119+
# Received Date: timestamp[ms]
120+
# Ship Date: timestamp[ms]
121+
# Ship Charge: decimal128(18, 4)
122+
# Total Cycle Time: double
123+
# Product In Stock: string
124+
# PID: int32
125+
# Market Segment: string
126+

0 commit comments

Comments
 (0)