Skip to content

Commit ebbbda9

Browse files
authored
[DOCS] add docs on csv files (#1824)
1 parent 9bdf193 commit ebbbda9

File tree

2 files changed

+193
-0
lines changed

2 files changed

+193
-0
lines changed
Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Apache Sedona CSV with geometry using Spark
21+
22+
This post shows how to read and write CSV files with geometry columns using Sedona and Spark.
23+
24+
You will learn about the advantages and disadvantages of the CSV file format for storing geometric data.
25+
26+
Let’s start by seeing how to write CSV files with geometric data.
27+
28+
## Write CSV with geometry using Sedona and Spark
29+
30+
Let’s start by creating a DataFrame with Sedona and Spark:
31+
32+
```python
33+
df = sedona.createDataFrame(
34+
[
35+
("a", "LINESTRING(2.0 5.0,6.0 1.0)"),
36+
("b", "POINT(1.0 2.0)"),
37+
("c", "POLYGON((7.0 1.0,7.0 3.0,9.0 3.0,7.0 1.0))"),
38+
],
39+
["id", "geometry"],
40+
)
41+
df = df.withColumn("geometry", ST_GeomFromText(col("geometry")))
42+
```
43+
44+
Here are the contents of the DataFrame:
45+
46+
```
47+
+---+------------------------------+
48+
|id |geometry |
49+
+---+------------------------------+
50+
|a |LINESTRING (2 5, 6 1) |
51+
|b |POINT (1 2) |
52+
|c |POLYGON ((7 1, 7 3, 9 3, 7 1))|
53+
+---+------------------------------+
54+
```
55+
56+
Now write the DataFrame to a CSV file:
57+
58+
```python
59+
df = df.withColumn("geom_wkt", ST_AsText(col("geometry"))).drop("geometry")
60+
df.repartition(1).write.option("header", True).format("csv").mode("overwrite").save(
61+
"/tmp/my_csvs"
62+
)
63+
```
64+
65+
Notice that we’re using `repartition(1)` to output the DataFrame as a single file. It’s usually better to output many files in parallel, making the write operation faster. We’re just writing to a single file for the simplicity of this example.
66+
67+
Here are the contents of the CSV file:
68+
69+
```
70+
id,geom_wkt
71+
a,"LINESTRING (2 5, 6 1)"
72+
b,POINT (1 2)
73+
c,"POLYGON ((7 1, 7 3, 9 3, 7 1))"
74+
```
75+
76+
This file stores the `geom_wkt` column as plain text, making it easily human-readable. It follows a standard format, so any engine that knows how to parse WKT can read the column.
77+
78+
## Read CSV with geometry using Sedona and Spark
79+
80+
Now read the CSV file into a DataFrame:
81+
82+
```python
83+
df = (
84+
sedona.read.option("header", True)
85+
.format("CSV")
86+
.load("/tmp/my_csvs")
87+
.withColumn("geometry", ST_GeomFromText(col("geom_wkt")))
88+
.drop("geom_wkt")
89+
)
90+
```
91+
92+
This file stores the `geom_wkt` column as text. When you read the data, you must convert it to a geometry column with the `ST_GeomFromText` function. Here are the contents of the DataFrame:
93+
94+
```
95+
+---+------------------------------+
96+
|id |geometry |
97+
+---+------------------------------+
98+
|a |LINESTRING (2 5, 6 1) |
99+
|b |POINT (1 2) |
100+
|c |POLYGON ((7 1, 7 3, 9 3, 7 1))|
101+
+---+------------------------------+
102+
```
103+
104+
Verify that the schema is correct:
105+
106+
```
107+
root
108+
|-- id: string (nullable = true)
109+
|-- geometry: geometry (nullable = true)
110+
```
111+
112+
## Read/write CSV files with Extended Well-Known Text (EWKT)
113+
114+
Let’s see how to write the DataFrame to CSV with EWKT. Start by adding the SRID to the geometry column.
115+
116+
```python
117+
df = df.withColumn("geometry", ST_SetSRID(col("geometry"), 4326))
118+
```
119+
120+
Now write out the DataFrame with an EWKT column:
121+
122+
```python
123+
df = df.withColumn("geom_ewkt", ST_AsEWKT(col("geometry"))).drop("geometry")
124+
df.repartition(1).write.option("header", True).format("csv").mode("overwrite").save(
125+
"/tmp/my_ewkt_csvs"
126+
)
127+
```
128+
129+
Here are the contents of the CSV file:
130+
131+
```
132+
id,geom_ewkt
133+
a,"SRID=4326;LINESTRING (2 5, 6 1)"
134+
b,SRID=4326;POINT (1 2)
135+
c,"SRID=4326;POLYGON ((7 1, 7 3, 9 3, 7 1))"
136+
```
137+
138+
Here’s how to read the CSV file with an EWKT column into a Sedona DataFrame:
139+
140+
```python
141+
df = (
142+
sedona.read.option("header", True)
143+
.format("csv")
144+
.load("/tmp/my_ewkt_csvs")
145+
.withColumn("geometry", ST_GeomFromEWKT(col("geom_ewkt")))
146+
.drop("geom_ewkt")
147+
)
148+
```
149+
150+
Here are the contents of the DataFrame:
151+
152+
```
153+
+---+------------------------------+
154+
|id |geometry |
155+
+---+------------------------------+
156+
|a |LINESTRING (2 5, 6 1) |
157+
|b |POINT (1 2) |
158+
|c |POLYGON ((7 1, 7 3, 9 3, 7 1))|
159+
+---+------------------------------+
160+
```
161+
162+
You don’t see the SRID when printing the Sedona DataFrame, but this metadata is maintained internally.
163+
164+
## Advantages of CSV for data with geometry
165+
166+
There are a few advantages of using CSV with geometry data:
167+
168+
* Many engines support CSV
169+
* It’s human-readable
170+
* The “extended” format saves CRS information
171+
* The standard has withstood the test of time
172+
173+
But CSV also has lots of disadvantages.
174+
175+
## Disadvantages of CSV for datasets with geometry
176+
177+
Here are the disadvantages of storing geometric data in CSV files:
178+
179+
* CSV is a row-oriented file format, so engines can’t cherry-pick individual columns while reading data. Column-oriented files allow for column pruning, an important performance feature.
180+
* CSV’s row-oriented nature makes it harder to compress files.
181+
* CSV files don’t contain the schema of the data so engines need to either infer the schema or users need to manually specify it when reading the data. Inferring the schema is error-prone, and manually specifying the schema is tedious.
182+
* CSV doesn’t store row-group metadata, so row-group skipping isn’t possible.
183+
* CSV doesn’t store file-level metadata, so file skipping isn’t possible.
184+
* When SRID metadata is tracked, it’s written on every line of the CSV file, which unnecessarily takes up a lot of space because CSVs don’t support file-level metadata.
185+
186+
## Conclusion
187+
188+
Spark and Sedona support the CSV file format for geometric data, but it generally is slow and should only be used when necessary.
189+
190+
If you’re building a geospatial data lake, GeoParquet is almost always a better alternative.
191+
192+
And if you’re building a geospatial data lakehouse, then Iceberg is a good option.

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ nav:
6161
- Sedona R: api/rdocs
6262
- Work with GeoPandas and Shapely: tutorial/geopandas-shapely.md
6363
- Files:
64+
- CSV: tutorial/files/csv-geometry-sedona-spark.md
6465
- GeoParquet: tutorial/files/geoparquet-sedona-spark.md
6566
- GeoJSON: tutorial/files/geojson-sedona-spark.md
6667
- Map visualization SQL app:

0 commit comments

Comments
 (0)