Skip to content

Commit ac96ba5

Browse files
Merge pull request #660 from laughingman7743/docs/update-claude-md
2 parents fdfdb24 + 78b8fd7 commit ac96ba5

File tree

1 file changed

+38
-275
lines changed

1 file changed

+38
-275
lines changed

CLAUDE.md

Lines changed: 38 additions & 275 deletions
Original file line numberDiff line numberDiff line change
@@ -1,302 +1,65 @@
11
# PyAthena Development Guide for AI Assistants
22

33
## Project Overview
4-
PyAthena is a Python DB API 2.0 (PEP 249) compliant client library for Amazon Athena. It enables Python applications to execute SQL queries against data stored in S3 using AWS Athena's serverless query engine.
4+
PyAthena is a Python DB API 2.0 (PEP 249) compliant client for Amazon Athena. See `pyproject.toml` for Python version support and dependencies.
55

6-
**License**: MIT
7-
**Version**: See `pyathena/__init__.py`
8-
**Python Support**: See `requires-python` in `pyproject.toml`
9-
10-
## Key Architectural Principles
11-
12-
### 1. DB API 2.0 Compliance
13-
- Strictly follow PEP 249 specifications for all cursor and connection implementations
14-
- Maintain compatibility with standard Python database usage patterns
15-
- All cursor implementations must support the standard methods: `execute()`, `fetchone()`, `fetchmany()`, `fetchall()`, `close()`
16-
17-
### 2. Multiple Cursor Types
18-
The project supports different cursor implementations for various use cases:
19-
- **Standard Cursor** (`pyathena.cursor.Cursor`): Basic DB API cursor
20-
- **Async Cursor** (`pyathena.async_cursor.AsyncCursor`): For asynchronous operations
21-
- **Pandas Cursor** (`pyathena.pandas.cursor.PandasCursor`): Returns results as DataFrames
22-
- **Arrow Cursor** (`pyathena.arrow.cursor.ArrowCursor`): Returns results in Apache Arrow format
23-
- **Spark Cursor** (`pyathena.spark.cursor.SparkCursor`): For PySpark integration
24-
25-
### 3. Type System and Conversion
26-
- Data type conversion is handled in `pyathena/converter.py`
27-
- Custom converters can be registered for specific Athena data types
28-
- Always preserve type safety and handle NULL values appropriately
29-
- Follow the type mapping defined in the converters for each cursor type
30-
31-
## Development Guidelines
6+
## Rules and Constraints
327

338
### Git Workflow
9+
- **NEVER** commit directly to `master` — always create a feature branch and PR
10+
- Create PRs as drafts: `gh pr create --draft`
3411

35-
**CRITICAL: Never Commit Directly to Master Branch**
36-
- **NEVER** commit directly to the `master` branch
37-
- **ALWAYS** create a feature branch for any changes
38-
- **ALWAYS** create a Pull Request (PR) for review
39-
- Use descriptive branch names (e.g., `feature/add-converter`, `fix/null-handling`)
40-
- Create PRs as drafts using `gh pr create --draft`
41-
42-
### Code Style and Quality
43-
44-
#### Import Guidelines
45-
**CRITICAL: Runtime Imports are Prohibited**
46-
- **NEVER** use `import` or `from ... import` statements inside functions, methods, or conditional blocks
47-
- **ALWAYS** place all imports at the top of the file, after the license header and module docstring
48-
- This applies to all files: source code, tests, scripts, documentation examples
49-
- Runtime imports cause issues with static analysis, code completion, dependency tracking, and can mask import errors
12+
### Import Rules
13+
- **NEVER** use runtime imports (inside functions, methods, or conditional blocks)
14+
- All imports must be at the top of the file, after the license header
15+
- Exception: the existing codebase uses runtime imports for optional dependencies (`pyarrow`, `pandas`, etc.) in source code. For new code, use `TYPE_CHECKING` instead when possible
5016

51-
**Bad Examples:**
52-
```python
53-
def my_function():
54-
from some_module import something # NEVER do this
55-
import os # NEVER do this
56-
if condition:
57-
from optional import feature # NEVER do this
58-
```
59-
60-
**Good Examples:**
61-
```python
62-
# At the top of the file, after license header
63-
from __future__ import annotations
64-
65-
import os
66-
from some_module import something
67-
from typing import Optional
68-
69-
# Optional dependencies can be handled with TYPE_CHECKING
70-
from typing import TYPE_CHECKING
71-
if TYPE_CHECKING:
72-
from optional import feature
73-
74-
def my_function():
75-
# Use imported modules here
76-
return something.process()
77-
```
78-
79-
**Exception for Optional Dependencies**: The PyAthena codebase does use runtime imports for optional dependencies like `pyarrow` and `pandas` in the main source code. However, when contributing new code or modifying tests, avoid runtime imports unless absolutely necessary for optional dependency handling.
80-
81-
#### Commands
17+
### Code Quality — Always Run Before Committing
8218
```bash
83-
# Format code (auto-fix imports and format)
84-
make fmt
85-
86-
# Run all checks (lint, format check, type check)
87-
make chk
88-
89-
# Run tests (includes running checks first)
90-
make test
91-
92-
# Run SQLAlchemy-specific tests
93-
make test-sqla
94-
95-
# Run full test suite with tox
96-
make tox
97-
98-
# Build documentation
99-
make docs
100-
```
101-
102-
#### Docstring Style
103-
Use Google style docstrings for all public methods and complex internal methods:
104-
105-
```python
106-
def method_name(self, param1: str, param2: Optional[int] = None) -> List[str]:
107-
"""Brief description of what the method does.
108-
109-
Longer description if needed, explaining the method's behavior,
110-
edge cases, or important details.
111-
112-
Args:
113-
param1: Description of the first parameter.
114-
param2: Description of the optional parameter.
115-
116-
Returns:
117-
Description of the return value.
118-
119-
Raises:
120-
ValueError: When invalid parameters are provided.
121-
"""
19+
make fmt # Auto-fix formatting and imports
20+
make chk # Lint + format check + mypy
12221
```
12322

124-
### Testing Requirements
125-
126-
#### General Guidelines
127-
1. **Unit Tests**: All new features must include unit tests
128-
2. **Integration Tests**: Test actual AWS Athena interactions when modifying query execution logic
129-
3. **SQLAlchemy Compliance**: Ensure SQLAlchemy dialect tests pass when modifying dialect code
130-
4. **Mock AWS Services**: Use `moto` or similar for testing AWS interactions without real resources
131-
5. **LINT First**: **ALWAYS** run `make chk` before running tests - ensure code passes all quality checks first
132-
133-
#### Local Testing Environment
134-
To run tests locally, you need to set the following environment variables:
135-
23+
### Testing
13624
```bash
137-
export AWS_DEFAULT_REGION=<your-region>
138-
export AWS_ATHENA_S3_STAGING_DIR=s3://<your-bucket>/<path>/
139-
export AWS_ATHENA_WORKGROUP=<your-workgroup>
140-
export AWS_ATHENA_SPARK_WORKGROUP=<your-spark-workgroup>
25+
# ALWAYS run `make chk` first — tests will fail if lint doesn't pass
26+
make test # Unit tests (runs chk first)
27+
make test-sqla # SQLAlchemy dialect tests
14128
```
14229

143-
**Using .env file (Recommended)**:
144-
Create a `.env` file in the project root (already in `.gitignore`) with your AWS settings, then load it before running tests:
145-
30+
Tests require AWS environment variables. Use a `.env` file (gitignored):
14631
```bash
147-
# Load .env and run tests
148-
export $(cat .env | xargs) && uv run pytest tests/pyathena/test_file.py -v
32+
AWS_DEFAULT_REGION=<region>
33+
AWS_ATHENA_S3_STAGING_DIR=s3://<bucket>/<path>/
34+
AWS_ATHENA_WORKGROUP=<workgroup>
35+
AWS_ATHENA_SPARK_WORKGROUP=<spark-workgroup>
14936
```
150-
151-
**CRITICAL: Pre-test Requirements**
15237
```bash
153-
# ALWAYS run quality checks first - tests will fail if code doesn't pass lint
154-
make chk
155-
156-
# Only after lint passes, install dependencies and run tests
157-
uv sync
15838
export $(cat .env | xargs) && uv run pytest tests/pyathena/test_file.py -v
15939
```
16040

161-
#### Writing Tests
162-
- Place tests in `tests/pyathena/` mirroring the source structure
163-
- Use pytest fixtures for common setup (see `conftest.py`)
164-
- Test both success and error cases
165-
- For filesystem operations, test edge cases like empty results, missing files, etc.
166-
167-
Example test structure:
168-
```python
169-
def test_find_maxdepth(self, fs):
170-
"""Test find with maxdepth parameter."""
171-
# Setup test data
172-
dir_ = f"s3://{ENV.s3_staging_bucket}/test_path"
173-
fs.touch(f"{dir_}/file0.txt")
174-
fs.touch(f"{dir_}/level1/file1.txt")
175-
176-
# Test maxdepth=0
177-
result = fs.find(dir_, maxdepth=0)
178-
assert len(result) == 1
179-
assert fs._strip_protocol(f"{dir_}/file0.txt") in result
180-
181-
# Test edge cases and error conditions
182-
with pytest.raises(ValueError):
183-
fs.find("s3://", maxdepth=0)
184-
```
185-
186-
#### Test Organization
187-
- Group related tests in classes (e.g., `TestS3FileSystem`)
188-
- Use descriptive test names that explain what is being tested
189-
- Keep tests focused and independent
190-
- Clean up test data after each test when using real AWS resources
191-
192-
### Common Development Tasks
193-
194-
#### Adding a New Feature
195-
1. Check if it aligns with DB API 2.0 specifications
196-
2. Consider impact on all cursor types (standard, pandas, arrow, spark)
197-
3. Update type hints and ensure mypy passes
198-
4. Add comprehensive tests
199-
5. Update documentation if adding public APIs
200-
201-
#### Modifying Query Execution
202-
- The core query execution logic is in `cursor.py` and `async_cursor.py`
203-
- Always handle query cancellation properly (SIGINT should cancel running queries)
204-
- Respect the `kill_on_interrupt` parameter
205-
- Maintain compatibility with Athena engine versions 2 and 3
206-
207-
#### Working with AWS Services
208-
- All AWS interactions use `boto3`
209-
- Credentials are managed through standard AWS credential chain
210-
- Always handle AWS exceptions appropriately (see `error.py`)
211-
- S3 operations for result retrieval are in `result_set.py`
212-
213-
### Project Structure Conventions
214-
215-
```
216-
pyathena/
217-
├── {cursor_type}/ # Cursor-specific implementations
218-
│ ├── __init__.py
219-
│ ├── cursor.py # Cursor implementation
220-
│ ├── converter.py # Type converters
221-
│ └── result_set.py # Result handling
222-
223-
├── sqlalchemy/ # SQLAlchemy dialect implementations
224-
│ ├── base.py # Base dialect
225-
│ ├── {dialect}.py # Specific dialects (rest, pandas, arrow)
226-
│ └── requirements.py # SQLAlchemy requirements
227-
228-
└── filesystem/ # S3 filesystem abstractions
229-
├── s3.py # S3FileSystem implementation (fsspec compatible)
230-
└── s3_object.py # S3 object representations
231-
```
232-
233-
### Important Implementation Details
234-
235-
#### Parameter Formatting
236-
- Two parameter styles supported: `pyformat` (default) and `qmark`
237-
- Parameter formatting logic in `formatter.py`
238-
- PyFormat: `%(name)s` style
239-
- Qmark: `?` style
240-
- Always escape special characters in parameter values
241-
242-
#### Result Set Handling
243-
- Results are typically staged in S3 (configured via `s3_staging_dir`)
244-
- Large result sets should be streamed, not loaded entirely into memory
245-
- Different result set implementations for different data formats (CSV, JSON, Parquet)
246-
247-
#### Error Handling
248-
- All exceptions inherit from `pyathena.error.Error`
249-
- Follow DB API 2.0 exception hierarchy
250-
- Provide meaningful error messages that include Athena query IDs when available
251-
252-
#### S3 FileSystem Operations
253-
- `S3FileSystem` implements fsspec's `AbstractFileSystem` interface
254-
- Key methods include `ls()`, `find()`, `get()`, `put()`, `rm()`, etc.
255-
- `find()` method supports:
256-
- `maxdepth`: Limits directory traversal depth (uses recursive approach for efficiency)
257-
- `withdirs`: Controls whether directories are included in results (default: False)
258-
- Cache management uses `(path, delimiter)` as key to handle different listing modes
259-
- Always extract reusable logic into helper methods (e.g., `_extract_parent_directories()`)
41+
- Tests mirror source structure under `tests/pyathena/`
42+
- Use pytest fixtures from `conftest.py`
43+
- New features require tests; changes to SQLAlchemy dialects must pass `make test-sqla`
26044

261-
When implementing filesystem methods:
262-
1. **Consider s3fs compatibility** - Many users migrate from s3fs, so matching its behavior is important
263-
2. **Optimize for S3's API** - Use delimiter="/" for recursive operations to minimize API calls
264-
3. **Handle edge cases** - Empty paths, trailing slashes, bucket-only paths
265-
4. **Test with real S3** - Mock tests may not catch S3-specific behaviors
45+
## Architecture — Key Design Decisions
26646

267-
### Performance Considerations
268-
1. **Result Caching**: Utilize Athena's result reuse feature (engine v3) when possible
269-
2. **Batch Operations**: Support `executemany()` for bulk operations
270-
3. **Memory Efficiency**: Stream large results instead of loading all into memory
271-
4. **Connection Pooling**: Connections are relatively lightweight, but avoid creating excessive connections
47+
These are non-obvious conventions that can't be discovered by reading code alone.
27248

273-
### Security Best Practices
274-
1. **Never log sensitive data** (credentials, query results with PII)
275-
2. **Support encryption** (SSE-S3, SSE-KMS, CSE-KMS) for S3 operations
276-
3. **Validate and sanitize** all user inputs, especially in query construction
277-
4. **Use parameterized queries** to prevent SQL injection
49+
### PEP 249 Compliance
50+
All cursor types must implement: `execute()`, `fetchone()`, `fetchmany()`, `fetchall()`, `close()`. New cursor features must follow the DB API 2.0 specification.
27851

279-
### Debugging Tips
280-
1. Enable debug logging: `logging.getLogger("pyathena").setLevel(logging.DEBUG)`
281-
2. Check Athena query history in AWS Console for failed queries
282-
3. Verify S3 permissions for both staging directory and data access
283-
4. Use `EXPLAIN` or `SHOW` statements to debug query plans
52+
### Cursor Module Pattern
53+
Each cursor type lives in its own subpackage (`pandas/`, `arrow/`, `polars/`, `s3fs/`, `spark/`) with a consistent structure: `cursor.py`, `async_cursor.py`, `converter.py`, `result_set.py`. When adding features, consider impact on all cursor types.
28454

285-
### Common Pitfalls to Avoid
286-
1. Don't assume all Athena data types map directly to Python types
287-
2. Remember that Athena queries are asynchronous - always wait for completion
288-
3. Handle the case where S3 results might be deleted or inaccessible
289-
4. Don't forget to close cursors and connections to clean up resources
290-
5. Be aware of Athena service quotas and rate limits
55+
### Filesystem (fsspec) Compatibility
56+
`pyathena/filesystem/s3.py` implements fsspec's `AbstractFileSystem`. When modifying:
57+
- Match `s3fs` library behavior where possible (users migrate from it)
58+
- Use `delimiter="/"` in S3 API calls to minimize requests
59+
- Handle edge cases: empty paths, trailing slashes, bucket-only paths
29160

292-
### Release Process
293-
1. Update version in `pyathena/__init__.py`
294-
2. Ensure all tests pass
295-
3. Create a git tag for the release
296-
4. Build and publish to PyPI
61+
### Version Management
62+
Versions are derived from git tags via `hatch-vcs` — never edit `pyathena/_version.py` manually.
29763

298-
## Contact and Resources
299-
- **Repository**: https://github.com/laughingman7743/PyAthena
300-
- **Documentation**: https://laughingman7743.github.io/PyAthena/
301-
- **Issues**: Report bugs or request features via GitHub Issues
302-
- **AWS Athena Docs**: https://docs.aws.amazon.com/athena/
64+
### Google-style Docstrings
65+
Use Google-style docstrings for public methods. See existing code for examples.

0 commit comments

Comments
 (0)