Skip to content

Commit 78b8fd7

Browse files
Rewrite CLAUDE.md for maintainability
Reduce from 357 lines to 66 lines by removing information that is either discoverable from code or generic best practices. Keep only: - Rules and constraints (git workflow, import rules, quality checks) - Testing setup (env vars, commands) - Non-obvious architectural decisions (PEP 249, cursor module pattern, fsspec compatibility, version management) Remove: detailed project structure tree, file-by-file listings, parameter formatting internals, generic security/debugging/performance tips, docstring examples, detailed release process, contact section. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent c022c26 commit 78b8fd7

File tree

1 file changed

+38
-329
lines changed

1 file changed

+38
-329
lines changed

CLAUDE.md

Lines changed: 38 additions & 329 deletions
Original file line numberDiff line numberDiff line change
@@ -1,356 +1,65 @@
11
# PyAthena Development Guide for AI Assistants
22

33
## Project Overview
4-
PyAthena is a Python DB API 2.0 (PEP 249) compliant client library for Amazon Athena. It enables Python applications to execute SQL queries against data stored in S3 using AWS Athena's serverless query engine.
4+
PyAthena is a Python DB API 2.0 (PEP 249) compliant client for Amazon Athena. See `pyproject.toml` for Python version support and dependencies.
55

6-
**License**: MIT
7-
**Version**: See `pyathena/__init__.py`
8-
**Python Support**: See `requires-python` in `pyproject.toml`
9-
10-
## Key Architectural Principles
11-
12-
### 1. DB API 2.0 Compliance
13-
- Strictly follow PEP 249 specifications for all cursor and connection implementations
14-
- Maintain compatibility with standard Python database usage patterns
15-
- All cursor implementations must support the standard methods: `execute()`, `fetchone()`, `fetchmany()`, `fetchall()`, `close()`
16-
17-
### 2. Multiple Cursor Types
18-
The project supports different cursor implementations for various use cases:
19-
- **Standard Cursor** (`pyathena.cursor.Cursor`): Basic DB API cursor returning tuples
20-
- **Pandas Cursor** (`pyathena.pandas.cursor.PandasCursor`): Returns results as pandas DataFrames
21-
- **Arrow Cursor** (`pyathena.arrow.cursor.ArrowCursor`): Returns results in Apache Arrow format
22-
- **Polars Cursor** (`pyathena.polars.cursor.PolarsCursor`): Returns results as Polars DataFrames
23-
- **S3FS Cursor** (`pyathena.s3fs.cursor.S3FSCursor`): Lightweight CSV-based cursor using S3 filesystem (no pandas/arrow dependency)
24-
- **Spark Cursor** (`pyathena.spark.cursor.SparkCursor`): For PySpark integration with Athena Spark workgroups
25-
26-
Each cursor type (except Spark) has a corresponding async variant (e.g., `AsyncCursor`, `AsyncPandasCursor`, `AsyncArrowCursor`, `AsyncPolarsCursor`, `AsyncS3FSCursor`).
27-
28-
### 3. Type System and Conversion
29-
- Data type conversion is handled in `pyathena/converter.py`
30-
- Custom converters can be registered for specific Athena data types
31-
- Always preserve type safety and handle NULL values appropriately
32-
- Follow the type mapping defined in the converters for each cursor type
33-
34-
## Development Guidelines
6+
## Rules and Constraints
357

368
### Git Workflow
9+
- **NEVER** commit directly to `master` — always create a feature branch and PR
10+
- Create PRs as drafts: `gh pr create --draft`
3711

38-
**CRITICAL: Never Commit Directly to Master Branch**
39-
- **NEVER** commit directly to the `master` branch
40-
- **ALWAYS** create a feature branch for any changes
41-
- **ALWAYS** create a Pull Request (PR) for review
42-
- Use descriptive branch names (e.g., `feature/add-converter`, `fix/null-handling`)
43-
- Create PRs as drafts using `gh pr create --draft`
44-
45-
### Code Style and Quality
12+
### Import Rules
13+
- **NEVER** use runtime imports (inside functions, methods, or conditional blocks)
14+
- All imports must be at the top of the file, after the license header
15+
- Exception: the existing codebase uses runtime imports for optional dependencies (`pyarrow`, `pandas`, etc.) in source code. For new code, use `TYPE_CHECKING` instead when possible
4616

47-
#### Import Guidelines
48-
**CRITICAL: Runtime Imports are Prohibited**
49-
- **NEVER** use `import` or `from ... import` statements inside functions, methods, or conditional blocks
50-
- **ALWAYS** place all imports at the top of the file, after the license header and module docstring
51-
- This applies to all files: source code, tests, scripts, documentation examples
52-
- Runtime imports cause issues with static analysis, code completion, dependency tracking, and can mask import errors
53-
54-
**Bad Examples:**
55-
```python
56-
def my_function():
57-
from some_module import something # NEVER do this
58-
import os # NEVER do this
59-
if condition:
60-
from optional import feature # NEVER do this
61-
```
62-
63-
**Good Examples:**
64-
```python
65-
# At the top of the file, after license header
66-
from __future__ import annotations
67-
68-
import os
69-
from some_module import something
70-
from typing import Optional
71-
72-
# Optional dependencies can be handled with TYPE_CHECKING
73-
from typing import TYPE_CHECKING
74-
if TYPE_CHECKING:
75-
from optional import feature
76-
77-
def my_function():
78-
# Use imported modules here
79-
return something.process()
80-
```
81-
82-
**Exception for Optional Dependencies**: The PyAthena codebase does use runtime imports for optional dependencies like `pyarrow` and `pandas` in the main source code. However, when contributing new code or modifying tests, avoid runtime imports unless absolutely necessary for optional dependency handling.
83-
84-
#### Commands
17+
### Code Quality — Always Run Before Committing
8518
```bash
86-
# Format code (auto-fix imports and format)
87-
make fmt
88-
89-
# Run all checks (lint, format check, type check)
90-
make chk
91-
92-
# Run tests (includes running checks first)
93-
make test
94-
95-
# Run SQLAlchemy-specific tests
96-
make test-sqla
97-
98-
# Run full test suite with tox
99-
make tox
100-
101-
# Build documentation
102-
make docs
103-
```
104-
105-
#### Docstring Style
106-
Use Google style docstrings for all public methods and complex internal methods:
107-
108-
```python
109-
def method_name(self, param1: str, param2: Optional[int] = None) -> List[str]:
110-
"""Brief description of what the method does.
111-
112-
Longer description if needed, explaining the method's behavior,
113-
edge cases, or important details.
114-
115-
Args:
116-
param1: Description of the first parameter.
117-
param2: Description of the optional parameter.
118-
119-
Returns:
120-
Description of the return value.
121-
122-
Raises:
123-
ValueError: When invalid parameters are provided.
124-
"""
19+
make fmt # Auto-fix formatting and imports
20+
make chk # Lint + format check + mypy
12521
```
12622

127-
### Testing Requirements
128-
129-
#### General Guidelines
130-
1. **Unit Tests**: All new features must include unit tests
131-
2. **Integration Tests**: Test actual AWS Athena interactions when modifying query execution logic
132-
3. **SQLAlchemy Compliance**: Ensure SQLAlchemy dialect tests pass when modifying dialect code
133-
4. **Mock AWS Services**: Use `moto` or similar for testing AWS interactions without real resources
134-
5. **LINT First**: **ALWAYS** run `make chk` before running tests - ensure code passes all quality checks first
135-
136-
#### Local Testing Environment
137-
To run tests locally, you need to set the following environment variables:
138-
23+
### Testing
13924
```bash
140-
export AWS_DEFAULT_REGION=<your-region>
141-
export AWS_ATHENA_S3_STAGING_DIR=s3://<your-bucket>/<path>/
142-
export AWS_ATHENA_WORKGROUP=<your-workgroup>
143-
export AWS_ATHENA_SPARK_WORKGROUP=<your-spark-workgroup>
25+
# ALWAYS run `make chk` first — tests will fail if lint doesn't pass
26+
make test # Unit tests (runs chk first)
27+
make test-sqla # SQLAlchemy dialect tests
14428
```
14529

146-
**Using .env file (Recommended)**:
147-
Create a `.env` file in the project root (already in `.gitignore`) with your AWS settings, then load it before running tests:
148-
30+
Tests require AWS environment variables. Use a `.env` file (gitignored):
14931
```bash
150-
# Load .env and run tests
151-
export $(cat .env | xargs) && uv run pytest tests/pyathena/test_file.py -v
32+
AWS_DEFAULT_REGION=<region>
33+
AWS_ATHENA_S3_STAGING_DIR=s3://<bucket>/<path>/
34+
AWS_ATHENA_WORKGROUP=<workgroup>
35+
AWS_ATHENA_SPARK_WORKGROUP=<spark-workgroup>
15236
```
153-
154-
**CRITICAL: Pre-test Requirements**
15537
```bash
156-
# ALWAYS run quality checks first - tests will fail if code doesn't pass lint
157-
make chk
158-
159-
# Only after lint passes, install dependencies and run tests
160-
uv sync
16138
export $(cat .env | xargs) && uv run pytest tests/pyathena/test_file.py -v
16239
```
16340

164-
#### Writing Tests
165-
- Place tests in `tests/pyathena/` mirroring the source structure
166-
- Use pytest fixtures for common setup (see `conftest.py`)
167-
- Test both success and error cases
168-
- For filesystem operations, test edge cases like empty results, missing files, etc.
169-
170-
Example test structure:
171-
```python
172-
def test_find_maxdepth(self, fs):
173-
"""Test find with maxdepth parameter."""
174-
# Setup test data
175-
dir_ = f"s3://{ENV.s3_staging_bucket}/test_path"
176-
fs.touch(f"{dir_}/file0.txt")
177-
fs.touch(f"{dir_}/level1/file1.txt")
178-
179-
# Test maxdepth=0
180-
result = fs.find(dir_, maxdepth=0)
181-
assert len(result) == 1
182-
assert fs._strip_protocol(f"{dir_}/file0.txt") in result
183-
184-
# Test edge cases and error conditions
185-
with pytest.raises(ValueError):
186-
fs.find("s3://", maxdepth=0)
187-
```
188-
189-
#### Test Organization
190-
- Group related tests in classes (e.g., `TestS3FileSystem`)
191-
- Use descriptive test names that explain what is being tested
192-
- Keep tests focused and independent
193-
- Clean up test data after each test when using real AWS resources
194-
195-
### Common Development Tasks
196-
197-
#### Adding a New Feature
198-
1. Check if it aligns with DB API 2.0 specifications
199-
2. Consider impact on all cursor types (standard, pandas, arrow, polars, s3fs, spark)
200-
3. Update type hints and ensure mypy passes
201-
4. Add comprehensive tests
202-
5. Update documentation if adding public APIs
203-
204-
#### Modifying Query Execution
205-
- The core query execution logic is in `cursor.py` and `async_cursor.py`
206-
- Always handle query cancellation properly (SIGINT should cancel running queries)
207-
- Respect the `kill_on_interrupt` parameter
208-
- Maintain compatibility with Athena engine versions 2 and 3
209-
210-
#### Working with AWS Services
211-
- All AWS interactions use `boto3`
212-
- Credentials are managed through standard AWS credential chain
213-
- Always handle AWS exceptions appropriately (see `error.py`)
214-
- S3 operations for result retrieval are in `result_set.py`
215-
216-
### Project Structure Conventions
217-
218-
```
219-
pyathena/
220-
├── __init__.py # DB API 2.0 globals, connect() entry point
221-
├── connection.py # Connection class
222-
├── cursor.py # Standard Cursor
223-
├── async_cursor.py # Standard AsyncCursor
224-
├── common.py # Base cursor classes (BaseCursor, CursorIterator)
225-
├── converter.py # Type conversion utilities
226-
├── formatter.py # SQL parameter formatting, UNLOAD wrapping
227-
├── result_set.py # Base result set handling
228-
├── model.py # Data models and enums
229-
├── error.py # Exception hierarchy
230-
├── util.py # Utility functions
231-
232-
├── pandas/ # Pandas cursor implementation
233-
│ ├── cursor.py # PandasCursor
234-
│ ├── async_cursor.py # AsyncPandasCursor
235-
│ ├── converter.py # Pandas type converters
236-
│ └── result_set.py # Pandas result set handling
237-
238-
├── arrow/ # Arrow cursor implementation
239-
│ ├── cursor.py # ArrowCursor
240-
│ ├── async_cursor.py # AsyncArrowCursor
241-
│ ├── converter.py # Arrow type converters
242-
│ └── result_set.py # Arrow result set handling
243-
244-
├── polars/ # Polars cursor implementation
245-
│ ├── cursor.py # PolarsCursor
246-
│ ├── async_cursor.py # AsyncPolarsCursor
247-
│ ├── converter.py # Polars type converters
248-
│ └── result_set.py # Polars result set handling
249-
250-
├── s3fs/ # S3FS cursor implementation (lightweight CSV reader)
251-
│ ├── cursor.py # S3FSCursor
252-
│ ├── async_cursor.py # AsyncS3FSCursor
253-
│ ├── reader.py # CSV reader implementation
254-
│ ├── converter.py # S3FS type converters
255-
│ └── result_set.py # S3FS result set handling
256-
257-
├── spark/ # Spark cursor implementation
258-
│ ├── cursor.py # SparkCursor
259-
│ ├── async_cursor.py # AsyncSparkCursor
260-
│ └── common.py # Spark utilities
261-
262-
├── sqlalchemy/ # SQLAlchemy dialect implementations
263-
│ ├── base.py # Base AthenaDialect
264-
│ ├── rest.py # AthenaRestDialect (standard cursor)
265-
│ ├── pandas.py # AthenaPandasDialect
266-
│ ├── arrow.py # AthenaArrowDialect
267-
│ ├── polars.py # AthenaPolarsDialect
268-
│ ├── s3fs.py # AthenaS3FSDialect
269-
│ ├── compiler.py # SQL compiler for Athena
270-
│ ├── types.py # SQLAlchemy type mappings
271-
│ ├── preparer.py # SQL identifier preparer
272-
│ ├── constants.py # Dialect constants
273-
│ ├── util.py # Dialect utilities
274-
│ └── requirements.py # SQLAlchemy compatibility requirements
275-
276-
└── filesystem/ # S3 filesystem abstractions
277-
├── s3.py # S3FileSystem implementation (fsspec compatible)
278-
└── s3_object.py # S3 object representations
279-
```
280-
281-
### Important Implementation Details
282-
283-
#### Parameter Formatting
284-
- Parameter style: `pyformat` (`%(name)s` style) as declared in DB API 2.0 globals
285-
- Parameter formatting logic in `formatter.py` (`DefaultParameterFormatter`)
286-
- Uses Presto-style escaping (single quote doubling) for SELECT/WITH/INSERT/UPDATE/MERGE statements
287-
- Uses Hive-style escaping (backslash-based) for DDL statements (CREATE, DROP, etc.)
288-
- Always escape special characters in parameter values
289-
- `Formatter.wrap_unload()` wraps SELECT/WITH queries with UNLOAD for high-performance Parquet/ORC result retrieval
290-
291-
#### Result Set Handling
292-
- Results are typically staged in S3 (configured via `s3_staging_dir`)
293-
- Large result sets should be streamed, not loaded entirely into memory
294-
- Different result set implementations for different data formats (CSV, JSON, Parquet)
295-
296-
#### Error Handling
297-
- All exceptions inherit from `pyathena.error.Error`
298-
- Follow DB API 2.0 exception hierarchy
299-
- Provide meaningful error messages that include Athena query IDs when available
300-
301-
#### S3 FileSystem Operations
302-
- `S3FileSystem` implements fsspec's `AbstractFileSystem` interface
303-
- Key methods include `ls()`, `find()`, `get()`, `put()`, `rm()`, etc.
304-
- `find()` method supports:
305-
- `maxdepth`: Limits directory traversal depth (uses recursive approach for efficiency)
306-
- `withdirs`: Controls whether directories are included in results (default: False)
307-
- Cache management uses `(path, delimiter)` as key to handle different listing modes
308-
- Always extract reusable logic into helper methods (e.g., `_extract_parent_directories()`)
309-
310-
When implementing filesystem methods:
311-
1. **Consider s3fs compatibility** - Many users migrate from s3fs, so matching its behavior is important
312-
2. **Optimize for S3's API** - Use delimiter="/" for recursive operations to minimize API calls
313-
3. **Handle edge cases** - Empty paths, trailing slashes, bucket-only paths
314-
4. **Test with real S3** - Mock tests may not catch S3-specific behaviors
315-
316-
### Performance Considerations
317-
1. **Result Caching**: Utilize Athena's result reuse feature (engine v3) when possible
318-
2. **Batch Operations**: Support `executemany()` for bulk operations
319-
3. **Memory Efficiency**: Stream large results instead of loading all into memory
320-
4. **Connection Pooling**: Connections are relatively lightweight, but avoid creating excessive connections
321-
322-
### Security Best Practices
323-
1. **Never log sensitive data** (credentials, query results with PII)
324-
2. **Support encryption** (SSE-S3, SSE-KMS, CSE-KMS) for S3 operations
325-
3. **Validate and sanitize** all user inputs, especially in query construction
326-
4. **Use parameterized queries** to prevent SQL injection
41+
- Tests mirror source structure under `tests/pyathena/`
42+
- Use pytest fixtures from `conftest.py`
43+
- New features require tests; changes to SQLAlchemy dialects must pass `make test-sqla`
32744

328-
### Debugging Tips
329-
1. Enable debug logging: `logging.getLogger("pyathena").setLevel(logging.DEBUG)`
330-
2. Check Athena query history in AWS Console for failed queries
331-
3. Verify S3 permissions for both staging directory and data access
332-
4. Use `EXPLAIN` or `SHOW` statements to debug query plans
45+
## Architecture — Key Design Decisions
33346

334-
### Common Pitfalls to Avoid
335-
1. Don't assume all Athena data types map directly to Python types
336-
2. Remember that Athena queries are asynchronous - always wait for completion
337-
3. Handle the case where S3 results might be deleted or inaccessible
338-
4. Don't forget to close cursors and connections to clean up resources
339-
5. Be aware of Athena service quotas and rate limits
47+
These are non-obvious conventions that can't be discovered by reading code alone.
34048

341-
### Build System and Release Process
49+
### PEP 249 Compliance
50+
All cursor types must implement: `execute()`, `fetchone()`, `fetchmany()`, `fetchall()`, `close()`. New cursor features must follow the DB API 2.0 specification.
34251

343-
**Build System**: Hatchling with hatch-vcs for version control system integration.
52+
### Cursor Module Pattern
53+
Each cursor type lives in its own subpackage (`pandas/`, `arrow/`, `polars/`, `s3fs/`, `spark/`) with a consistent structure: `cursor.py`, `async_cursor.py`, `converter.py`, `result_set.py`. When adding features, consider impact on all cursor types.
34454

345-
**Version Management**: Versions are automatically derived from git tags via `hatch-vcs`. The generated version file is `pyathena/_version.py` (auto-generated, do not edit manually).
55+
### Filesystem (fsspec) Compatibility
56+
`pyathena/filesystem/s3.py` implements fsspec's `AbstractFileSystem`. When modifying:
57+
- Match `s3fs` library behavior where possible (users migrate from it)
58+
- Use `delimiter="/"` in S3 API calls to minimize requests
59+
- Handle edge cases: empty paths, trailing slashes, bucket-only paths
34660

347-
**Release Process**:
348-
1. Ensure all tests pass
349-
2. Create a git tag for the release (version is derived from the tag)
350-
3. Build and publish to PyPI
61+
### Version Management
62+
Versions are derived from git tags via `hatch-vcs` — never edit `pyathena/_version.py` manually.
35163

352-
## Contact and Resources
353-
- **Repository**: https://github.com/laughingman7743/PyAthena
354-
- **Documentation**: https://laughingman7743.github.io/PyAthena/
355-
- **Issues**: Report bugs or request features via GitHub Issues
356-
- **AWS Athena Docs**: https://docs.aws.amazon.com/athena/
64+
### Google-style Docstrings
65+
Use Google-style docstrings for public methods. See existing code for examples.

0 commit comments

Comments
 (0)