Add documentation for S3FS cursor

laughingman7743 · claude · laughingman7743 · commit 7807aa68576d · 2026-01-01T15:52:56.000+09:00
- Add docs/s3fs.rst with comprehensive S3FSCursor and AsyncS3FSCursor documentation - Add docs/api/s3fs.rst with API reference - Update docs/index.rst to include s3fs in toctree - Update docs/api.rst to include s3fs API reference The documentation covers: - Basic usage and connection examples - Type conversion mappings - Custom converter implementation - Limitations compared to Arrow/Pandas cursors - Use cases and recommendations - AsyncS3FSCursor for asynchronous operations 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
diff --git a/docs/api.rst b/docs/api.rst
@@ -12,6 +12,7 @@ This section provides comprehensive API documentation for all PyAthena classes a
    api/connection
    api/pandas
    api/arrow
+   api/s3fs
    api/spark
    api/converters
    api/filesystem
@@ -35,6 +36,7 @@ Specialized Integrations
 
 - :ref:`api_pandas` - pandas DataFrame integration
 - :ref:`api_arrow` - Apache Arrow columnar data integration
+- :ref:`api_s3fs` - Lightweight S3FS-based cursor (no pandas/pyarrow required)
 - :ref:`api_spark` - Apache Spark integration for big data processing
 
 Infrastructure
diff --git a/docs/api/s3fs.rst b/docs/api/s3fs.rst
@@ -0,0 +1,30 @@
+.. _api_s3fs:
+
+S3FS Integration
+================
+
+This section covers lightweight S3FS-based cursors and data converters that use Python's built-in ``csv`` module.
+
+S3FS Cursors
+------------
+
+.. autoclass:: pyathena.s3fs.cursor.S3FSCursor
+   :members:
+   :inherited-members:
+
+.. autoclass:: pyathena.s3fs.async_cursor.AsyncS3FSCursor
+   :members:
+   :inherited-members:
+
+S3FS Data Converters
+--------------------
+
+.. autoclass:: pyathena.s3fs.converter.DefaultS3FSTypeConverter
+   :members:
+
+S3FS Result Set
+---------------
+
+.. autoclass:: pyathena.s3fs.result_set.AthenaS3FSResultSet
+   :members:
+   :inherited-members:
diff --git a/docs/index.rst b/docs/index.rst
@@ -20,6 +20,7 @@ Documentation
    cursor
    pandas
    arrow
+   s3fs
    spark
    testing
    api
diff --git a/docs/s3fs.rst b/docs/s3fs.rst
@@ -0,0 +1,327 @@
+.. _s3fs:
+
+S3FS
+====
+
+.. _s3fs-cursor:
+
+S3FSCursor
+----------
+
+S3FSCursor is a lightweight cursor that directly handles the CSV file of the query execution result output to S3.
+Unlike ArrowCursor or PandasCursor, this cursor uses Python's built-in ``csv`` module to parse results,
+making it ideal for environments where installing pandas or pyarrow is not desirable.
+
+**Key features:**
+
+- No pandas or pyarrow dependencies required
+- Uses Python's built-in ``csv`` module for parsing
+- Lower memory footprint for simple query results
+- Full DB API 2.0 compatibility
+
+You can use the S3FSCursor by specifying the ``cursor_class``
+with the connect method or connection object.
+
+.. code:: python
+
+    from pyathena import connect
+    from pyathena.s3fs.cursor import S3FSCursor
+
+    cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
+                     region_name="us-west-2",
+                     cursor_class=S3FSCursor).cursor()
+
+.. code:: python
+
+    from pyathena.connection import Connection
+    from pyathena.s3fs.cursor import S3FSCursor
+
+    cursor = Connection(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
+                        region_name="us-west-2",
+                        cursor_class=S3FSCursor).cursor()
+
+It can also be used by specifying the cursor class when calling the connection object's cursor method.
+
+.. code:: python
+
+    from pyathena import connect
+    from pyathena.s3fs.cursor import S3FSCursor
+
+    cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
+                     region_name="us-west-2").cursor(S3FSCursor)
+
+.. code:: python
+
+    from pyathena.connection import Connection
+    from pyathena.s3fs.cursor import S3FSCursor
+
+    cursor = Connection(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
+                        region_name="us-west-2").cursor(S3FSCursor)
+
+Support fetch and iterate query results.
+
+.. code:: python
+
+    from pyathena import connect
+    from pyathena.s3fs.cursor import S3FSCursor
+
+    cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
+                     region_name="us-west-2",
+                     cursor_class=S3FSCursor).cursor()
+
+    cursor.execute("SELECT * FROM many_rows")
+    print(cursor.fetchone())
+    print(cursor.fetchmany())
+    print(cursor.fetchall())
+
+.. code:: python
+
+    from pyathena import connect
+    from pyathena.s3fs.cursor import S3FSCursor
+
+    cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
+                     region_name="us-west-2",
+                     cursor_class=S3FSCursor).cursor()
+
+    cursor.execute("SELECT * FROM many_rows")
+    for row in cursor:
+        print(row)
+
+Execution information of the query can also be retrieved.
+
+.. code:: python
+
+    from pyathena import connect
+    from pyathena.s3fs.cursor import S3FSCursor
+
+    cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
+                     region_name="us-west-2",
+                     cursor_class=S3FSCursor).cursor()
+
+    cursor.execute("SELECT * FROM many_rows")
+    print(cursor.state)
+    print(cursor.state_change_reason)
+    print(cursor.completion_date_time)
+    print(cursor.submission_date_time)
+    print(cursor.data_scanned_in_bytes)
+    print(cursor.engine_execution_time_in_millis)
+    print(cursor.query_queue_time_in_millis)
+    print(cursor.total_execution_time_in_millis)
+    print(cursor.query_planning_time_in_millis)
+    print(cursor.service_processing_time_in_millis)
+    print(cursor.output_location)
+
+Type Conversion
+~~~~~~~~~~~~~~~
+
+S3FSCursor converts Athena data types to Python types using the built-in converter.
+The following type mappings are used:
+
+.. list-table:: Type Mappings
+   :header-rows: 1
+   :widths: 30 70
+
+   * - Athena Type
+     - Python Type
+   * - boolean
+     - bool
+   * - tinyint, smallint, integer, bigint
+     - int
+   * - float, double, real
+     - float
+   * - decimal
+     - decimal.Decimal
+   * - char, varchar, string
+     - str
+   * - date
+     - datetime.date
+   * - timestamp
+     - datetime.datetime
+   * - time
+     - datetime.time
+   * - binary, varbinary
+     - bytes
+   * - array, map, row (struct)
+     - Parsed as Python list/dict using JSON-like parsing
+   * - json
+     - Parsed JSON (dict or list)
+
+If you want to customize type conversion, create a converter class like this:
+
+.. code:: python
+
+    from pyathena.s3fs.converter import DefaultS3FSTypeConverter
+
+    class CustomS3FSTypeConverter(DefaultS3FSTypeConverter):
+        def __init__(self) -> None:
+            super().__init__()
+            # Override specific type mappings
+            self._mappings["custom_type"] = self._convert_custom
+
+        def _convert_custom(self, value: str) -> Any:
+            # Your custom conversion logic
+            return value.upper()
+
+Then specify an instance of this class in the converter argument when creating a cursor.
+
+.. code:: python
+
+    from pyathena import connect
+    from pyathena.s3fs.cursor import S3FSCursor
+
+    cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
+                     region_name="us-west-2").cursor(S3FSCursor, converter=CustomS3FSTypeConverter())
+
+Limitations
+~~~~~~~~~~~
+
+S3FSCursor has some limitations compared to ArrowCursor or PandasCursor:
+
+- **No UNLOAD support**: S3FSCursor reads CSV results directly and does not support the UNLOAD option
+  that outputs results in Parquet format.
+- **Sequential reading**: Results are read row by row from the CSV file, which may be slower
+  for very large result sets compared to columnar formats.
+- **No DataFrame conversion**: There is no ``as_pandas()`` or ``as_arrow()`` method.
+  Use PandasCursor or ArrowCursor if you need DataFrame operations.
+
+When to use S3FSCursor
+~~~~~~~~~~~~~~~~~~~~~~
+
+S3FSCursor is recommended when:
+
+- You want to minimize dependencies (no pandas/pyarrow required)
+- You're working in a constrained environment (e.g., AWS Lambda with size limits)
+- You only need simple row-by-row result processing
+- Memory efficiency is important and results don't need columnar operations
+
+For large-scale data processing or analytical workloads, consider using ArrowCursor or PandasCursor instead.
+
+.. _async-s3fs-cursor:
+
+AsyncS3FSCursor
+---------------
+
+AsyncS3FSCursor is an AsyncCursor that uses the same lightweight CSV parsing as S3FSCursor.
+This cursor is useful when you need to execute queries asynchronously without pandas or pyarrow dependencies.
+
+You can use the AsyncS3FSCursor by specifying the ``cursor_class``
+with the connect method or connection object.
+
+.. code:: python
+
+    from pyathena import connect
+    from pyathena.s3fs.async_cursor import AsyncS3FSCursor
+
+    cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
+                     region_name="us-west-2",
+                     cursor_class=AsyncS3FSCursor).cursor()
+
+.. code:: python
+
+    from pyathena.connection import Connection
+    from pyathena.s3fs.async_cursor import AsyncS3FSCursor
+
+    cursor = Connection(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
+                        region_name="us-west-2",
+                        cursor_class=AsyncS3FSCursor).cursor()
+
+It can also be used by specifying the cursor class when calling the connection object's cursor method.
+
+.. code:: python
+
+    from pyathena import connect
+    from pyathena.s3fs.async_cursor import AsyncS3FSCursor
+
+    cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
+                     region_name="us-west-2").cursor(AsyncS3FSCursor)
+
+.. code:: python
+
+    from pyathena.connection import Connection
+    from pyathena.s3fs.async_cursor import AsyncS3FSCursor
+
+    cursor = Connection(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
+                        region_name="us-west-2").cursor(AsyncS3FSCursor)
+
+The default number of workers is 5 or cpu number * 5.
+If you want to change the number of workers you can specify like the following.
+
+.. code:: python
+
+    from pyathena import connect
+    from pyathena.s3fs.async_cursor import AsyncS3FSCursor
+
+    cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
+                     region_name="us-west-2",
+                     cursor_class=AsyncS3FSCursor).cursor(max_workers=10)
+
+The execute method of the AsyncS3FSCursor returns the tuple of the query ID and the `future object`_.
+
+.. code:: python
+
+    from pyathena import connect
+    from pyathena.s3fs.async_cursor import AsyncS3FSCursor
+
+    cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
+                     region_name="us-west-2",
+                     cursor_class=AsyncS3FSCursor).cursor()
+
+    query_id, future = cursor.execute("SELECT * FROM many_rows")
+
+The return value of the `future object`_ is an ``AthenaS3FSResultSet`` object.
+This object has an interface similar to ``AthenaResultSetObject``.
+
+.. code:: python
+
+    from pyathena import connect
+    from pyathena.s3fs.async_cursor import AsyncS3FSCursor
+
+    cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
+                     region_name="us-west-2",
+                     cursor_class=AsyncS3FSCursor).cursor()
+
+    query_id, future = cursor.execute("SELECT * FROM many_rows")
+    result_set = future.result()
+    print(result_set.state)
+    print(result_set.state_change_reason)
+    print(result_set.completion_date_time)
+    print(result_set.submission_date_time)
+    print(result_set.data_scanned_in_bytes)
+    print(result_set.engine_execution_time_in_millis)
+    print(result_set.query_queue_time_in_millis)
+    print(result_set.total_execution_time_in_millis)
+    print(result_set.query_planning_time_in_millis)
+    print(result_set.service_processing_time_in_millis)
+    print(result_set.output_location)
+    print(result_set.description)
+    for row in result_set:
+        print(row)
+
+.. code:: python
+
+    from pyathena import connect
+    from pyathena.s3fs.async_cursor import AsyncS3FSCursor
+
+    cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
+                     region_name="us-west-2",
+                     cursor_class=AsyncS3FSCursor).cursor()
+
+    query_id, future = cursor.execute("SELECT * FROM many_rows")
+    result_set = future.result()
+    print(result_set.fetchall())
+
+As with AsyncCursor, you need a query ID to cancel a query.
+
+.. code:: python
+
+    from pyathena import connect
+    from pyathena.s3fs.async_cursor import AsyncS3FSCursor
+
+    cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
+                     region_name="us-west-2",
+                     cursor_class=AsyncS3FSCursor).cursor()
+
+    query_id, future = cursor.execute("SELECT * FROM many_rows")
+    cursor.cancel(query_id)
+
+.. _`future object`: https://docs.python.org/3/library/concurrent.futures.html#future-objects

-Original file line number
+Diff line change
    cursor
    pandas
    arrow
 +   s3fs
    spark
    testing
    api