|
1 | 1 | # PyAthena Development Guide for AI Assistants |
2 | 2 |
|
3 | 3 | ## Project Overview |
4 | | -PyAthena is a Python DB API 2.0 (PEP 249) compliant client library for Amazon Athena. It enables Python applications to execute SQL queries against data stored in S3 using AWS Athena's serverless query engine. |
| 4 | +PyAthena is a Python DB API 2.0 (PEP 249) compliant client for Amazon Athena. See `pyproject.toml` for Python version support and dependencies. |
5 | 5 |
|
6 | | -**License**: MIT |
7 | | -**Version**: See `pyathena/__init__.py` |
8 | | -**Python Support**: See `requires-python` in `pyproject.toml` |
9 | | - |
10 | | -## Key Architectural Principles |
11 | | - |
12 | | -### 1. DB API 2.0 Compliance |
13 | | -- Strictly follow PEP 249 specifications for all cursor and connection implementations |
14 | | -- Maintain compatibility with standard Python database usage patterns |
15 | | -- All cursor implementations must support the standard methods: `execute()`, `fetchone()`, `fetchmany()`, `fetchall()`, `close()` |
16 | | - |
17 | | -### 2. Multiple Cursor Types |
18 | | -The project supports different cursor implementations for various use cases: |
19 | | -- **Standard Cursor** (`pyathena.cursor.Cursor`): Basic DB API cursor |
20 | | -- **Async Cursor** (`pyathena.async_cursor.AsyncCursor`): For asynchronous operations |
21 | | -- **Pandas Cursor** (`pyathena.pandas.cursor.PandasCursor`): Returns results as DataFrames |
22 | | -- **Arrow Cursor** (`pyathena.arrow.cursor.ArrowCursor`): Returns results in Apache Arrow format |
23 | | -- **Spark Cursor** (`pyathena.spark.cursor.SparkCursor`): For PySpark integration |
24 | | - |
25 | | -### 3. Type System and Conversion |
26 | | -- Data type conversion is handled in `pyathena/converter.py` |
27 | | -- Custom converters can be registered for specific Athena data types |
28 | | -- Always preserve type safety and handle NULL values appropriately |
29 | | -- Follow the type mapping defined in the converters for each cursor type |
30 | | - |
31 | | -## Development Guidelines |
| 6 | +## Rules and Constraints |
32 | 7 |
|
33 | 8 | ### Git Workflow |
| 9 | +- **NEVER** commit directly to `master` — always create a feature branch and PR |
| 10 | +- Create PRs as drafts: `gh pr create --draft` |
34 | 11 |
|
35 | | -**CRITICAL: Never Commit Directly to Master Branch** |
36 | | -- **NEVER** commit directly to the `master` branch |
37 | | -- **ALWAYS** create a feature branch for any changes |
38 | | -- **ALWAYS** create a Pull Request (PR) for review |
39 | | -- Use descriptive branch names (e.g., `feature/add-converter`, `fix/null-handling`) |
40 | | -- Create PRs as drafts using `gh pr create --draft` |
41 | | - |
42 | | -### Code Style and Quality |
43 | | - |
44 | | -#### Import Guidelines |
45 | | -**CRITICAL: Runtime Imports are Prohibited** |
46 | | -- **NEVER** use `import` or `from ... import` statements inside functions, methods, or conditional blocks |
47 | | -- **ALWAYS** place all imports at the top of the file, after the license header and module docstring |
48 | | -- This applies to all files: source code, tests, scripts, documentation examples |
49 | | -- Runtime imports cause issues with static analysis, code completion, dependency tracking, and can mask import errors |
| 12 | +### Import Rules |
| 13 | +- **NEVER** use runtime imports (inside functions, methods, or conditional blocks) |
| 14 | +- All imports must be at the top of the file, after the license header |
| 15 | +- Exception: the existing codebase uses runtime imports for optional dependencies (`pyarrow`, `pandas`, etc.) in source code. For new code, use `TYPE_CHECKING` instead when possible |
50 | 16 |
|
51 | | -**Bad Examples:** |
52 | | -```python |
53 | | -def my_function(): |
54 | | - from some_module import something # NEVER do this |
55 | | - import os # NEVER do this |
56 | | - if condition: |
57 | | - from optional import feature # NEVER do this |
58 | | -``` |
59 | | - |
60 | | -**Good Examples:** |
61 | | -```python |
62 | | -# At the top of the file, after license header |
63 | | -from __future__ import annotations |
64 | | - |
65 | | -import os |
66 | | -from some_module import something |
67 | | -from typing import Optional |
68 | | - |
69 | | -# Optional dependencies can be handled with TYPE_CHECKING |
70 | | -from typing import TYPE_CHECKING |
71 | | -if TYPE_CHECKING: |
72 | | - from optional import feature |
73 | | - |
74 | | -def my_function(): |
75 | | - # Use imported modules here |
76 | | - return something.process() |
77 | | -``` |
78 | | - |
79 | | -**Exception for Optional Dependencies**: The PyAthena codebase does use runtime imports for optional dependencies like `pyarrow` and `pandas` in the main source code. However, when contributing new code or modifying tests, avoid runtime imports unless absolutely necessary for optional dependency handling. |
80 | | - |
81 | | -#### Commands |
| 17 | +### Code Quality — Always Run Before Committing |
82 | 18 | ```bash |
83 | | -# Format code (auto-fix imports and format) |
84 | | -make fmt |
85 | | - |
86 | | -# Run all checks (lint, format check, type check) |
87 | | -make chk |
88 | | - |
89 | | -# Run tests (includes running checks first) |
90 | | -make test |
91 | | - |
92 | | -# Run SQLAlchemy-specific tests |
93 | | -make test-sqla |
94 | | - |
95 | | -# Run full test suite with tox |
96 | | -make tox |
97 | | - |
98 | | -# Build documentation |
99 | | -make docs |
100 | | -``` |
101 | | - |
102 | | -#### Docstring Style |
103 | | -Use Google style docstrings for all public methods and complex internal methods: |
104 | | - |
105 | | -```python |
106 | | -def method_name(self, param1: str, param2: Optional[int] = None) -> List[str]: |
107 | | - """Brief description of what the method does. |
108 | | -
|
109 | | - Longer description if needed, explaining the method's behavior, |
110 | | - edge cases, or important details. |
111 | | -
|
112 | | - Args: |
113 | | - param1: Description of the first parameter. |
114 | | - param2: Description of the optional parameter. |
115 | | -
|
116 | | - Returns: |
117 | | - Description of the return value. |
118 | | -
|
119 | | - Raises: |
120 | | - ValueError: When invalid parameters are provided. |
121 | | - """ |
| 19 | +make fmt # Auto-fix formatting and imports |
| 20 | +make chk # Lint + format check + mypy |
122 | 21 | ``` |
123 | 22 |
|
124 | | -### Testing Requirements |
125 | | - |
126 | | -#### General Guidelines |
127 | | -1. **Unit Tests**: All new features must include unit tests |
128 | | -2. **Integration Tests**: Test actual AWS Athena interactions when modifying query execution logic |
129 | | -3. **SQLAlchemy Compliance**: Ensure SQLAlchemy dialect tests pass when modifying dialect code |
130 | | -4. **Mock AWS Services**: Use `moto` or similar for testing AWS interactions without real resources |
131 | | -5. **LINT First**: **ALWAYS** run `make chk` before running tests - ensure code passes all quality checks first |
132 | | - |
133 | | -#### Local Testing Environment |
134 | | -To run tests locally, you need to set the following environment variables: |
135 | | - |
| 23 | +### Testing |
136 | 24 | ```bash |
137 | | -export AWS_DEFAULT_REGION=<your-region> |
138 | | -export AWS_ATHENA_S3_STAGING_DIR=s3://<your-bucket>/<path>/ |
139 | | -export AWS_ATHENA_WORKGROUP=<your-workgroup> |
140 | | -export AWS_ATHENA_SPARK_WORKGROUP=<your-spark-workgroup> |
| 25 | +# ALWAYS run `make chk` first — tests will fail if lint doesn't pass |
| 26 | +make test # Unit tests (runs chk first) |
| 27 | +make test-sqla # SQLAlchemy dialect tests |
141 | 28 | ``` |
142 | 29 |
|
143 | | -**Using .env file (Recommended)**: |
144 | | -Create a `.env` file in the project root (already in `.gitignore`) with your AWS settings, then load it before running tests: |
145 | | - |
| 30 | +Tests require AWS environment variables. Use a `.env` file (gitignored): |
146 | 31 | ```bash |
147 | | -# Load .env and run tests |
148 | | -export $(cat .env | xargs) && uv run pytest tests/pyathena/test_file.py -v |
| 32 | +AWS_DEFAULT_REGION=<region> |
| 33 | +AWS_ATHENA_S3_STAGING_DIR=s3://<bucket>/<path>/ |
| 34 | +AWS_ATHENA_WORKGROUP=<workgroup> |
| 35 | +AWS_ATHENA_SPARK_WORKGROUP=<spark-workgroup> |
149 | 36 | ``` |
150 | | - |
151 | | -**CRITICAL: Pre-test Requirements** |
152 | 37 | ```bash |
153 | | -# ALWAYS run quality checks first - tests will fail if code doesn't pass lint |
154 | | -make chk |
155 | | - |
156 | | -# Only after lint passes, install dependencies and run tests |
157 | | -uv sync |
158 | 38 | export $(cat .env | xargs) && uv run pytest tests/pyathena/test_file.py -v |
159 | 39 | ``` |
160 | 40 |
|
161 | | -#### Writing Tests |
162 | | -- Place tests in `tests/pyathena/` mirroring the source structure |
163 | | -- Use pytest fixtures for common setup (see `conftest.py`) |
164 | | -- Test both success and error cases |
165 | | -- For filesystem operations, test edge cases like empty results, missing files, etc. |
166 | | - |
167 | | -Example test structure: |
168 | | -```python |
169 | | -def test_find_maxdepth(self, fs): |
170 | | - """Test find with maxdepth parameter.""" |
171 | | - # Setup test data |
172 | | - dir_ = f"s3://{ENV.s3_staging_bucket}/test_path" |
173 | | - fs.touch(f"{dir_}/file0.txt") |
174 | | - fs.touch(f"{dir_}/level1/file1.txt") |
175 | | - |
176 | | - # Test maxdepth=0 |
177 | | - result = fs.find(dir_, maxdepth=0) |
178 | | - assert len(result) == 1 |
179 | | - assert fs._strip_protocol(f"{dir_}/file0.txt") in result |
180 | | - |
181 | | - # Test edge cases and error conditions |
182 | | - with pytest.raises(ValueError): |
183 | | - fs.find("s3://", maxdepth=0) |
184 | | -``` |
185 | | - |
186 | | -#### Test Organization |
187 | | -- Group related tests in classes (e.g., `TestS3FileSystem`) |
188 | | -- Use descriptive test names that explain what is being tested |
189 | | -- Keep tests focused and independent |
190 | | -- Clean up test data after each test when using real AWS resources |
191 | | - |
192 | | -### Common Development Tasks |
193 | | - |
194 | | -#### Adding a New Feature |
195 | | -1. Check if it aligns with DB API 2.0 specifications |
196 | | -2. Consider impact on all cursor types (standard, pandas, arrow, spark) |
197 | | -3. Update type hints and ensure mypy passes |
198 | | -4. Add comprehensive tests |
199 | | -5. Update documentation if adding public APIs |
200 | | - |
201 | | -#### Modifying Query Execution |
202 | | -- The core query execution logic is in `cursor.py` and `async_cursor.py` |
203 | | -- Always handle query cancellation properly (SIGINT should cancel running queries) |
204 | | -- Respect the `kill_on_interrupt` parameter |
205 | | -- Maintain compatibility with Athena engine versions 2 and 3 |
206 | | - |
207 | | -#### Working with AWS Services |
208 | | -- All AWS interactions use `boto3` |
209 | | -- Credentials are managed through standard AWS credential chain |
210 | | -- Always handle AWS exceptions appropriately (see `error.py`) |
211 | | -- S3 operations for result retrieval are in `result_set.py` |
212 | | - |
213 | | -### Project Structure Conventions |
214 | | - |
215 | | -``` |
216 | | -pyathena/ |
217 | | -├── {cursor_type}/ # Cursor-specific implementations |
218 | | -│ ├── __init__.py |
219 | | -│ ├── cursor.py # Cursor implementation |
220 | | -│ ├── converter.py # Type converters |
221 | | -│ └── result_set.py # Result handling |
222 | | -│ |
223 | | -├── sqlalchemy/ # SQLAlchemy dialect implementations |
224 | | -│ ├── base.py # Base dialect |
225 | | -│ ├── {dialect}.py # Specific dialects (rest, pandas, arrow) |
226 | | -│ └── requirements.py # SQLAlchemy requirements |
227 | | -│ |
228 | | -└── filesystem/ # S3 filesystem abstractions |
229 | | - ├── s3.py # S3FileSystem implementation (fsspec compatible) |
230 | | - └── s3_object.py # S3 object representations |
231 | | -``` |
232 | | - |
233 | | -### Important Implementation Details |
234 | | - |
235 | | -#### Parameter Formatting |
236 | | -- Two parameter styles supported: `pyformat` (default) and `qmark` |
237 | | -- Parameter formatting logic in `formatter.py` |
238 | | -- PyFormat: `%(name)s` style |
239 | | -- Qmark: `?` style |
240 | | -- Always escape special characters in parameter values |
241 | | - |
242 | | -#### Result Set Handling |
243 | | -- Results are typically staged in S3 (configured via `s3_staging_dir`) |
244 | | -- Large result sets should be streamed, not loaded entirely into memory |
245 | | -- Different result set implementations for different data formats (CSV, JSON, Parquet) |
246 | | - |
247 | | -#### Error Handling |
248 | | -- All exceptions inherit from `pyathena.error.Error` |
249 | | -- Follow DB API 2.0 exception hierarchy |
250 | | -- Provide meaningful error messages that include Athena query IDs when available |
251 | | - |
252 | | -#### S3 FileSystem Operations |
253 | | -- `S3FileSystem` implements fsspec's `AbstractFileSystem` interface |
254 | | -- Key methods include `ls()`, `find()`, `get()`, `put()`, `rm()`, etc. |
255 | | -- `find()` method supports: |
256 | | - - `maxdepth`: Limits directory traversal depth (uses recursive approach for efficiency) |
257 | | - - `withdirs`: Controls whether directories are included in results (default: False) |
258 | | -- Cache management uses `(path, delimiter)` as key to handle different listing modes |
259 | | -- Always extract reusable logic into helper methods (e.g., `_extract_parent_directories()`) |
| 41 | +- Tests mirror source structure under `tests/pyathena/` |
| 42 | +- Use pytest fixtures from `conftest.py` |
| 43 | +- New features require tests; changes to SQLAlchemy dialects must pass `make test-sqla` |
260 | 44 |
|
261 | | -When implementing filesystem methods: |
262 | | -1. **Consider s3fs compatibility** - Many users migrate from s3fs, so matching its behavior is important |
263 | | -2. **Optimize for S3's API** - Use delimiter="/" for recursive operations to minimize API calls |
264 | | -3. **Handle edge cases** - Empty paths, trailing slashes, bucket-only paths |
265 | | -4. **Test with real S3** - Mock tests may not catch S3-specific behaviors |
| 45 | +## Architecture — Key Design Decisions |
266 | 46 |
|
267 | | -### Performance Considerations |
268 | | -1. **Result Caching**: Utilize Athena's result reuse feature (engine v3) when possible |
269 | | -2. **Batch Operations**: Support `executemany()` for bulk operations |
270 | | -3. **Memory Efficiency**: Stream large results instead of loading all into memory |
271 | | -4. **Connection Pooling**: Connections are relatively lightweight, but avoid creating excessive connections |
| 47 | +These are non-obvious conventions that can't be discovered by reading code alone. |
272 | 48 |
|
273 | | -### Security Best Practices |
274 | | -1. **Never log sensitive data** (credentials, query results with PII) |
275 | | -2. **Support encryption** (SSE-S3, SSE-KMS, CSE-KMS) for S3 operations |
276 | | -3. **Validate and sanitize** all user inputs, especially in query construction |
277 | | -4. **Use parameterized queries** to prevent SQL injection |
| 49 | +### PEP 249 Compliance |
| 50 | +All cursor types must implement: `execute()`, `fetchone()`, `fetchmany()`, `fetchall()`, `close()`. New cursor features must follow the DB API 2.0 specification. |
278 | 51 |
|
279 | | -### Debugging Tips |
280 | | -1. Enable debug logging: `logging.getLogger("pyathena").setLevel(logging.DEBUG)` |
281 | | -2. Check Athena query history in AWS Console for failed queries |
282 | | -3. Verify S3 permissions for both staging directory and data access |
283 | | -4. Use `EXPLAIN` or `SHOW` statements to debug query plans |
| 52 | +### Cursor Module Pattern |
| 53 | +Each cursor type lives in its own subpackage (`pandas/`, `arrow/`, `polars/`, `s3fs/`, `spark/`) with a consistent structure: `cursor.py`, `async_cursor.py`, `converter.py`, `result_set.py`. When adding features, consider impact on all cursor types. |
284 | 54 |
|
285 | | -### Common Pitfalls to Avoid |
286 | | -1. Don't assume all Athena data types map directly to Python types |
287 | | -2. Remember that Athena queries are asynchronous - always wait for completion |
288 | | -3. Handle the case where S3 results might be deleted or inaccessible |
289 | | -4. Don't forget to close cursors and connections to clean up resources |
290 | | -5. Be aware of Athena service quotas and rate limits |
| 55 | +### Filesystem (fsspec) Compatibility |
| 56 | +`pyathena/filesystem/s3.py` implements fsspec's `AbstractFileSystem`. When modifying: |
| 57 | +- Match `s3fs` library behavior where possible (users migrate from it) |
| 58 | +- Use `delimiter="/"` in S3 API calls to minimize requests |
| 59 | +- Handle edge cases: empty paths, trailing slashes, bucket-only paths |
291 | 60 |
|
292 | | -### Release Process |
293 | | -1. Update version in `pyathena/__init__.py` |
294 | | -2. Ensure all tests pass |
295 | | -3. Create a git tag for the release |
296 | | -4. Build and publish to PyPI |
| 61 | +### Version Management |
| 62 | +Versions are derived from git tags via `hatch-vcs` — never edit `pyathena/_version.py` manually. |
297 | 63 |
|
298 | | -## Contact and Resources |
299 | | -- **Repository**: https://github.com/laughingman7743/PyAthena |
300 | | -- **Documentation**: https://laughingman7743.github.io/PyAthena/ |
301 | | -- **Issues**: Report bugs or request features via GitHub Issues |
302 | | -- **AWS Athena Docs**: https://docs.aws.amazon.com/athena/ |
| 64 | +### Google-style Docstrings |
| 65 | +Use Google-style docstrings for public methods. See existing code for examples. |
0 commit comments