Skip to content

Commit 7807aa6

Browse files
Add documentation for S3FS cursor
- Add docs/s3fs.rst with comprehensive S3FSCursor and AsyncS3FSCursor documentation - Add docs/api/s3fs.rst with API reference - Update docs/index.rst to include s3fs in toctree - Update docs/api.rst to include s3fs API reference The documentation covers: - Basic usage and connection examples - Type conversion mappings - Custom converter implementation - Limitations compared to Arrow/Pandas cursors - Use cases and recommendations - AsyncS3FSCursor for asynchronous operations 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent d74546b commit 7807aa6

File tree

4 files changed

+360
-0
lines changed

4 files changed

+360
-0
lines changed

docs/api.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ This section provides comprehensive API documentation for all PyAthena classes a
1212
api/connection
1313
api/pandas
1414
api/arrow
15+
api/s3fs
1516
api/spark
1617
api/converters
1718
api/filesystem
@@ -35,6 +36,7 @@ Specialized Integrations
3536

3637
- :ref:`api_pandas` - pandas DataFrame integration
3738
- :ref:`api_arrow` - Apache Arrow columnar data integration
39+
- :ref:`api_s3fs` - Lightweight S3FS-based cursor (no pandas/pyarrow required)
3840
- :ref:`api_spark` - Apache Spark integration for big data processing
3941

4042
Infrastructure

docs/api/s3fs.rst

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
.. _api_s3fs:
2+
3+
S3FS Integration
4+
================
5+
6+
This section covers lightweight S3FS-based cursors and data converters that use Python's built-in ``csv`` module.
7+
8+
S3FS Cursors
9+
------------
10+
11+
.. autoclass:: pyathena.s3fs.cursor.S3FSCursor
12+
:members:
13+
:inherited-members:
14+
15+
.. autoclass:: pyathena.s3fs.async_cursor.AsyncS3FSCursor
16+
:members:
17+
:inherited-members:
18+
19+
S3FS Data Converters
20+
--------------------
21+
22+
.. autoclass:: pyathena.s3fs.converter.DefaultS3FSTypeConverter
23+
:members:
24+
25+
S3FS Result Set
26+
---------------
27+
28+
.. autoclass:: pyathena.s3fs.result_set.AthenaS3FSResultSet
29+
:members:
30+
:inherited-members:

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ Documentation
2020
cursor
2121
pandas
2222
arrow
23+
s3fs
2324
spark
2425
testing
2526
api

docs/s3fs.rst

Lines changed: 327 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,327 @@
1+
.. _s3fs:
2+
3+
S3FS
4+
====
5+
6+
.. _s3fs-cursor:
7+
8+
S3FSCursor
9+
----------
10+
11+
S3FSCursor is a lightweight cursor that directly handles the CSV file of the query execution result output to S3.
12+
Unlike ArrowCursor or PandasCursor, this cursor uses Python's built-in ``csv`` module to parse results,
13+
making it ideal for environments where installing pandas or pyarrow is not desirable.
14+
15+
**Key features:**
16+
17+
- No pandas or pyarrow dependencies required
18+
- Uses Python's built-in ``csv`` module for parsing
19+
- Lower memory footprint for simple query results
20+
- Full DB API 2.0 compatibility
21+
22+
You can use the S3FSCursor by specifying the ``cursor_class``
23+
with the connect method or connection object.
24+
25+
.. code:: python
26+
27+
from pyathena import connect
28+
from pyathena.s3fs.cursor import S3FSCursor
29+
30+
cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
31+
region_name="us-west-2",
32+
cursor_class=S3FSCursor).cursor()
33+
34+
.. code:: python
35+
36+
from pyathena.connection import Connection
37+
from pyathena.s3fs.cursor import S3FSCursor
38+
39+
cursor = Connection(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
40+
region_name="us-west-2",
41+
cursor_class=S3FSCursor).cursor()
42+
43+
It can also be used by specifying the cursor class when calling the connection object's cursor method.
44+
45+
.. code:: python
46+
47+
from pyathena import connect
48+
from pyathena.s3fs.cursor import S3FSCursor
49+
50+
cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
51+
region_name="us-west-2").cursor(S3FSCursor)
52+
53+
.. code:: python
54+
55+
from pyathena.connection import Connection
56+
from pyathena.s3fs.cursor import S3FSCursor
57+
58+
cursor = Connection(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
59+
region_name="us-west-2").cursor(S3FSCursor)
60+
61+
Support fetch and iterate query results.
62+
63+
.. code:: python
64+
65+
from pyathena import connect
66+
from pyathena.s3fs.cursor import S3FSCursor
67+
68+
cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
69+
region_name="us-west-2",
70+
cursor_class=S3FSCursor).cursor()
71+
72+
cursor.execute("SELECT * FROM many_rows")
73+
print(cursor.fetchone())
74+
print(cursor.fetchmany())
75+
print(cursor.fetchall())
76+
77+
.. code:: python
78+
79+
from pyathena import connect
80+
from pyathena.s3fs.cursor import S3FSCursor
81+
82+
cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
83+
region_name="us-west-2",
84+
cursor_class=S3FSCursor).cursor()
85+
86+
cursor.execute("SELECT * FROM many_rows")
87+
for row in cursor:
88+
print(row)
89+
90+
Execution information of the query can also be retrieved.
91+
92+
.. code:: python
93+
94+
from pyathena import connect
95+
from pyathena.s3fs.cursor import S3FSCursor
96+
97+
cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
98+
region_name="us-west-2",
99+
cursor_class=S3FSCursor).cursor()
100+
101+
cursor.execute("SELECT * FROM many_rows")
102+
print(cursor.state)
103+
print(cursor.state_change_reason)
104+
print(cursor.completion_date_time)
105+
print(cursor.submission_date_time)
106+
print(cursor.data_scanned_in_bytes)
107+
print(cursor.engine_execution_time_in_millis)
108+
print(cursor.query_queue_time_in_millis)
109+
print(cursor.total_execution_time_in_millis)
110+
print(cursor.query_planning_time_in_millis)
111+
print(cursor.service_processing_time_in_millis)
112+
print(cursor.output_location)
113+
114+
Type Conversion
115+
~~~~~~~~~~~~~~~
116+
117+
S3FSCursor converts Athena data types to Python types using the built-in converter.
118+
The following type mappings are used:
119+
120+
.. list-table:: Type Mappings
121+
:header-rows: 1
122+
:widths: 30 70
123+
124+
* - Athena Type
125+
- Python Type
126+
* - boolean
127+
- bool
128+
* - tinyint, smallint, integer, bigint
129+
- int
130+
* - float, double, real
131+
- float
132+
* - decimal
133+
- decimal.Decimal
134+
* - char, varchar, string
135+
- str
136+
* - date
137+
- datetime.date
138+
* - timestamp
139+
- datetime.datetime
140+
* - time
141+
- datetime.time
142+
* - binary, varbinary
143+
- bytes
144+
* - array, map, row (struct)
145+
- Parsed as Python list/dict using JSON-like parsing
146+
* - json
147+
- Parsed JSON (dict or list)
148+
149+
If you want to customize type conversion, create a converter class like this:
150+
151+
.. code:: python
152+
153+
from pyathena.s3fs.converter import DefaultS3FSTypeConverter
154+
155+
class CustomS3FSTypeConverter(DefaultS3FSTypeConverter):
156+
def __init__(self) -> None:
157+
super().__init__()
158+
# Override specific type mappings
159+
self._mappings["custom_type"] = self._convert_custom
160+
161+
def _convert_custom(self, value: str) -> Any:
162+
# Your custom conversion logic
163+
return value.upper()
164+
165+
Then specify an instance of this class in the converter argument when creating a cursor.
166+
167+
.. code:: python
168+
169+
from pyathena import connect
170+
from pyathena.s3fs.cursor import S3FSCursor
171+
172+
cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
173+
region_name="us-west-2").cursor(S3FSCursor, converter=CustomS3FSTypeConverter())
174+
175+
Limitations
176+
~~~~~~~~~~~
177+
178+
S3FSCursor has some limitations compared to ArrowCursor or PandasCursor:
179+
180+
- **No UNLOAD support**: S3FSCursor reads CSV results directly and does not support the UNLOAD option
181+
that outputs results in Parquet format.
182+
- **Sequential reading**: Results are read row by row from the CSV file, which may be slower
183+
for very large result sets compared to columnar formats.
184+
- **No DataFrame conversion**: There is no ``as_pandas()`` or ``as_arrow()`` method.
185+
Use PandasCursor or ArrowCursor if you need DataFrame operations.
186+
187+
When to use S3FSCursor
188+
~~~~~~~~~~~~~~~~~~~~~~
189+
190+
S3FSCursor is recommended when:
191+
192+
- You want to minimize dependencies (no pandas/pyarrow required)
193+
- You're working in a constrained environment (e.g., AWS Lambda with size limits)
194+
- You only need simple row-by-row result processing
195+
- Memory efficiency is important and results don't need columnar operations
196+
197+
For large-scale data processing or analytical workloads, consider using ArrowCursor or PandasCursor instead.
198+
199+
.. _async-s3fs-cursor:
200+
201+
AsyncS3FSCursor
202+
---------------
203+
204+
AsyncS3FSCursor is an AsyncCursor that uses the same lightweight CSV parsing as S3FSCursor.
205+
This cursor is useful when you need to execute queries asynchronously without pandas or pyarrow dependencies.
206+
207+
You can use the AsyncS3FSCursor by specifying the ``cursor_class``
208+
with the connect method or connection object.
209+
210+
.. code:: python
211+
212+
from pyathena import connect
213+
from pyathena.s3fs.async_cursor import AsyncS3FSCursor
214+
215+
cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
216+
region_name="us-west-2",
217+
cursor_class=AsyncS3FSCursor).cursor()
218+
219+
.. code:: python
220+
221+
from pyathena.connection import Connection
222+
from pyathena.s3fs.async_cursor import AsyncS3FSCursor
223+
224+
cursor = Connection(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
225+
region_name="us-west-2",
226+
cursor_class=AsyncS3FSCursor).cursor()
227+
228+
It can also be used by specifying the cursor class when calling the connection object's cursor method.
229+
230+
.. code:: python
231+
232+
from pyathena import connect
233+
from pyathena.s3fs.async_cursor import AsyncS3FSCursor
234+
235+
cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
236+
region_name="us-west-2").cursor(AsyncS3FSCursor)
237+
238+
.. code:: python
239+
240+
from pyathena.connection import Connection
241+
from pyathena.s3fs.async_cursor import AsyncS3FSCursor
242+
243+
cursor = Connection(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
244+
region_name="us-west-2").cursor(AsyncS3FSCursor)
245+
246+
The default number of workers is 5 or cpu number * 5.
247+
If you want to change the number of workers you can specify like the following.
248+
249+
.. code:: python
250+
251+
from pyathena import connect
252+
from pyathena.s3fs.async_cursor import AsyncS3FSCursor
253+
254+
cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
255+
region_name="us-west-2",
256+
cursor_class=AsyncS3FSCursor).cursor(max_workers=10)
257+
258+
The execute method of the AsyncS3FSCursor returns the tuple of the query ID and the `future object`_.
259+
260+
.. code:: python
261+
262+
from pyathena import connect
263+
from pyathena.s3fs.async_cursor import AsyncS3FSCursor
264+
265+
cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
266+
region_name="us-west-2",
267+
cursor_class=AsyncS3FSCursor).cursor()
268+
269+
query_id, future = cursor.execute("SELECT * FROM many_rows")
270+
271+
The return value of the `future object`_ is an ``AthenaS3FSResultSet`` object.
272+
This object has an interface similar to ``AthenaResultSetObject``.
273+
274+
.. code:: python
275+
276+
from pyathena import connect
277+
from pyathena.s3fs.async_cursor import AsyncS3FSCursor
278+
279+
cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
280+
region_name="us-west-2",
281+
cursor_class=AsyncS3FSCursor).cursor()
282+
283+
query_id, future = cursor.execute("SELECT * FROM many_rows")
284+
result_set = future.result()
285+
print(result_set.state)
286+
print(result_set.state_change_reason)
287+
print(result_set.completion_date_time)
288+
print(result_set.submission_date_time)
289+
print(result_set.data_scanned_in_bytes)
290+
print(result_set.engine_execution_time_in_millis)
291+
print(result_set.query_queue_time_in_millis)
292+
print(result_set.total_execution_time_in_millis)
293+
print(result_set.query_planning_time_in_millis)
294+
print(result_set.service_processing_time_in_millis)
295+
print(result_set.output_location)
296+
print(result_set.description)
297+
for row in result_set:
298+
print(row)
299+
300+
.. code:: python
301+
302+
from pyathena import connect
303+
from pyathena.s3fs.async_cursor import AsyncS3FSCursor
304+
305+
cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
306+
region_name="us-west-2",
307+
cursor_class=AsyncS3FSCursor).cursor()
308+
309+
query_id, future = cursor.execute("SELECT * FROM many_rows")
310+
result_set = future.result()
311+
print(result_set.fetchall())
312+
313+
As with AsyncCursor, you need a query ID to cancel a query.
314+
315+
.. code:: python
316+
317+
from pyathena import connect
318+
from pyathena.s3fs.async_cursor import AsyncS3FSCursor
319+
320+
cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
321+
region_name="us-west-2",
322+
cursor_class=AsyncS3FSCursor).cursor()
323+
324+
query_id, future = cursor.execute("SELECT * FROM many_rows")
325+
cursor.cancel(query_id)
326+
327+
.. _`future object`: https://docs.python.org/3/library/concurrent.futures.html#future-objects

0 commit comments

Comments
 (0)