Skip to content

Commit 3384cb4

Browse files
authored
Merge pull request #39 from vertti/feat/row-validation
Add optional row-level validation with Pydantic
2 parents aa61643 + 35f7fe7 commit 3384cb4

19 files changed

+2605
-12
lines changed

.claude/settings.local.json

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
{
2+
"permissions": {
3+
"allow": [
4+
"WebFetch(domain:raw.githubusercontent.com)"
5+
],
6+
"deny": [],
7+
"ask": []
8+
}
9+
}

.github/workflows/main.yml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -128,7 +128,7 @@ jobs:
128128
strategy:
129129
matrix:
130130
python-version: ["3.9", "3.13"]
131-
scenario: ["pandas-only", "polars-only", "both", "none"]
131+
scenario: ["pandas-only", "polars-only", "both", "pandas-no-pydantic", "none"]
132132
steps:
133133
- uses: actions/checkout@v3
134134

@@ -164,6 +164,12 @@ jobs:
164164
WHEEL=$(ls dist/daffy-*.whl | head -n1)
165165
uv run --no-project --with "pandas>=1.5.1" --with "polars>=1.7.0" --with "$WHEEL" python scripts/test_isolated_deps.py both
166166
167+
- name: Test pandas without pydantic scenario
168+
if: matrix.scenario == 'pandas-no-pydantic'
169+
run: |
170+
WHEEL=$(ls dist/daffy-*.whl | head -n1)
171+
uv run --no-project --with "pandas>=1.5.1" --with "$WHEEL" python scripts/test_isolated_deps.py pandas-no-pydantic
172+
167173
- name: Test no libraries scenario (expected to fail gracefully)
168174
if: matrix.scenario == 'none'
169175
run: |

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,16 @@
22

33
All notable changes to this project will be documented in this file.
44

5+
## 0.17.0
6+
7+
- Add optional row-level validation using Pydantic models (requires Pydantic >= 2.4.0)
8+
- New `row_validator` parameter for `@df_in` and `@df_out` decorators
9+
- Validates actual data values, not just column structure
10+
- Batch validation for optimal performance (10-100x faster than row-by-row)
11+
- Informative error messages showing which rows failed and why
12+
- Configuration via `pyproject.toml`: `row_validation_max_errors` and `row_validation_convert_nans`
13+
- Works with both Pandas and Polars DataFrames
14+
515
## 0.16.1
616

717
- Internal refactoring: extracted DataFrame type handling to dedicated module for better code organization and maintainability

CLAUDE.md

Lines changed: 298 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,298 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
Daffy is a DataFrame column validator library that provides runtime validation decorators (`@df_in`, `@df_out`, `@df_log`) for Pandas and Polars DataFrames. It validates column names (including regex patterns), data types, and enforces strictness rules through simple function decorators.
8+
9+
## Workflow
10+
11+
When working on this repository, follow this structured approach:
12+
13+
### 1. Plan First
14+
- Understand the full scope of the task before starting
15+
- Identify which modules will be affected
16+
- Consider edge cases and testing requirements
17+
- Break the work into logical, incremental steps
18+
19+
### 2. Small Working Commits
20+
- Each commit should be a complete, working unit of functionality
21+
- Break large features into multiple small commits
22+
- Each commit should pass all tests and linting
23+
- Commit messages should be clear and descriptive
24+
25+
### 3. Test-Driven Development Cycle
26+
27+
For each commit, follow this order:
28+
29+
```bash
30+
# 1. Write or update tests first
31+
# Add tests to appropriate tests/test_*.py file
32+
33+
# 2. Run tests to see them fail
34+
uv run pytest tests/test_your_feature.py
35+
36+
# 3. Implement the feature
37+
# Make changes to daffy/*.py files
38+
39+
# 4. Run tests to see them pass
40+
uv run pytest tests/test_your_feature.py
41+
42+
# 5. Run full test suite
43+
uv run pytest
44+
45+
# 6. Run linting and formatting
46+
uv run ruff format
47+
uv run ruff check --fix
48+
uv run pyrefly check .
49+
50+
# 7. Commit the changes
51+
git add .
52+
git commit -m "Descriptive commit message"
53+
```
54+
55+
### 4. Before Creating a PR
56+
57+
Check if documentation needs updating:
58+
- **README.md** - If public API or examples changed
59+
- **docs/usage.md** - If usage patterns or features changed
60+
- **docs/development.md** - If development workflow changed
61+
- **CHANGELOG.md** - Always update with changes (see existing format)
62+
- **Type hints** - Ensure all new functions have proper annotations
63+
64+
### 5. Important: Commit and PR Messages
65+
- **NEVER mention AI tools or assistants** in commit messages, PR descriptions, or code comments
66+
- Write commit messages as if you wrote the code yourself
67+
- Use conventional commit format when appropriate (e.g., "fix:", "feat:", "docs:")
68+
- Focus on what changed and why, not how it was developed
69+
70+
## Development Commands
71+
72+
### Setup
73+
```bash
74+
# Install dependencies using uv (preferred)
75+
uv sync --group test --group dev
76+
77+
# Alternative: install only test dependencies
78+
uv sync --group test
79+
```
80+
81+
### Testing
82+
```bash
83+
# Run all tests
84+
uv run pytest
85+
86+
# Run tests with coverage
87+
uv run pytest --cov --cov-report=html
88+
89+
# Run specific test file
90+
uv run pytest tests/test_df_in.py
91+
92+
# Run tests matching pattern
93+
uv run pytest -k "test_missing_columns"
94+
95+
# Run with verbose output
96+
uv run pytest -v
97+
```
98+
99+
### Linting and Type Checking
100+
```bash
101+
# Run Ruff formatter
102+
uv run ruff format
103+
104+
# Run Ruff linter
105+
uv run ruff check
106+
107+
# Run Ruff linter with auto-fix
108+
uv run ruff check --fix
109+
110+
# Run type checker (Pyrefly)
111+
uv run pyrefly check .
112+
```
113+
114+
### Pre-commit Hooks
115+
```bash
116+
# Install pre-commit hooks (runs ruff format + ruff check on each commit)
117+
pre-commit install
118+
```
119+
120+
### Building
121+
```bash
122+
# Build wheel package
123+
uv build --wheel
124+
125+
# Build both wheel and sdist
126+
uv build
127+
```
128+
129+
### Testing Optional Dependencies
130+
131+
Daffy supports optional dependencies (pandas-only, polars-only, or both). See `TESTING_OPTIONAL_DEPS.md` for details.
132+
133+
```bash
134+
# Build wheel first
135+
uv build --wheel
136+
137+
# Test with pandas only
138+
WHEEL=$(ls dist/daffy-*.whl | head -n1)
139+
uv run --no-project --with "pandas>=1.5.1" --with "$WHEEL" python scripts/test_isolated_deps.py pandas
140+
141+
# Test with polars only
142+
uv run --no-project --with "polars>=1.7.0" --with "$WHEEL" python scripts/test_isolated_deps.py polars
143+
144+
# Test with both libraries
145+
uv run --no-project --with "pandas>=1.5.1" --with "polars>=1.7.0" --with "$WHEEL" python scripts/test_isolated_deps.py both
146+
```
147+
148+
## Architecture
149+
150+
### Core Module Responsibilities
151+
152+
**decorators.py** - Public API and orchestration
153+
- Exports `df_in`, `df_out`, `df_log` decorators
154+
- Orchestrates validation by calling validation.py and utils.py
155+
- Manages configuration precedence (decorator param > config file > default)
156+
- Preserves type information using TypeVar for static type checking
157+
158+
**validation.py** - Core validation logic
159+
- `validate_dataframe()` is the central validation engine
160+
- Supports two modes: list-based (columns only) or dict-based (columns + dtypes)
161+
- Handles regex pattern matching via patterns.py
162+
- Accumulates all validation errors before raising single AssertionError
163+
- Performs strictness checking (no extra columns when strict=True)
164+
165+
**patterns.py** - Regex pattern handling
166+
- Recognizes `r/pattern/` syntax for regex column matching
167+
- Compiles regex patterns and caches them as `RegexColumnDef` tuples
168+
- Provides matching functions used by validation layer
169+
- Example: `"r/Price_[0-9]+/"` matches Price_1, Price_2, etc.
170+
171+
**utils.py** - Cross-cutting utilities
172+
- DataFrame type assertions using `assert_is_dataframe()`
173+
- Parameter extraction from function signatures via `get_parameter()`
174+
- Context formatting for error messages
175+
- DataFrame description for logging
176+
- Logging functions for df_log decorator
177+
178+
**config.py** - Configuration management
179+
- Loads `[tool.daffy]` section from pyproject.toml
180+
- Caches configuration on first access
181+
- Only searches in current working directory (not parent dirs)
182+
- Configuration precedence: decorator parameter > config file > False (default)
183+
184+
**dataframe_types.py** - Optional dependency handling
185+
- Dynamically constructs DataFrame type unions based on installed libraries
186+
- Supports pandas-only, polars-only, both, or neither scenarios
187+
- Separate compile-time (TYPE_CHECKING) and runtime type definitions
188+
- Provides `get_dataframe_types()` for isinstance() checks
189+
- **IMPORTANT**: This file is excluded from coverage (see pyproject.toml:88) because it's tested via isolation scenarios in CI
190+
191+
### Data Flow
192+
193+
```
194+
User calls decorated function
195+
196+
@df_in wrapper executes
197+
198+
get_parameter() extracts DataFrame from args/kwargs
199+
200+
assert_is_dataframe() validates type
201+
202+
get_strict() reads config (cached)
203+
204+
validate_dataframe() checks columns/dtypes/strictness
205+
206+
Original function executes
207+
208+
@df_out wrapper validates return value
209+
210+
Result returned to caller
211+
```
212+
213+
### Key Design Patterns
214+
215+
1. **Optional Dependency Injection**: dataframe_types.py dynamically builds type unions based on available libraries (pandas/polars)
216+
217+
2. **Lazy Configuration Loading**: Config file read once and cached; expensive operations happen only on first access
218+
219+
3. **Error Context Accumulation**: Validation collects ALL errors before raising, providing complete feedback in single exception
220+
221+
4. **Type-Safe Decorator Composition**: Uses TypeVar to preserve return types through decorator stack for static type checkers
222+
223+
5. **Regex Pattern Abstraction**: Patterns are compiled once and reused; validation layer doesn't handle regex directly
224+
225+
## Configuration
226+
227+
Users can set project-wide defaults in `pyproject.toml`:
228+
229+
```toml
230+
[tool.daffy]
231+
strict = false # or true to disallow extra columns by default
232+
```
233+
234+
Decorator parameters override config file settings:
235+
```python
236+
@df_in(columns=["A", "B"], strict=True) # strict=True overrides config
237+
```
238+
239+
## Testing Strategy
240+
241+
**Unit Tests** (tests/test_*.py):
242+
- test_df_in.py - Input validation decorator
243+
- test_df_out.py - Output validation decorator
244+
- test_df_log.py - Logging decorator
245+
- test_decorators.py - Decorator composition
246+
- test_config.py - Configuration loading
247+
- test_optional_dependencies.py - Library detection (always passes)
248+
- test_type_compatibility.py - Type hint compatibility
249+
250+
**Isolation Tests** (CI only via scripts/test_isolated_deps.py):
251+
- Test pandas-only, polars-only, both, and none scenarios in true isolation
252+
- Uses built wheel packages to avoid dev environment contamination
253+
- These tests may "fail" locally since both libraries are typically installed in dev
254+
255+
**Coverage Requirements**:
256+
- Minimum 95% coverage (pyproject.toml:92)
257+
- dataframe_types.py excluded (tested in isolation scenarios)
258+
259+
## Common Patterns
260+
261+
### Adding New Validation Logic
262+
263+
1. Add core validation logic to validation.py
264+
2. Integrate into `validate_dataframe()` function
265+
3. Add error message formatting in utils.py if needed
266+
4. Update decorators.py to pass new parameters
267+
5. Add tests in appropriate test_*.py file
268+
269+
### Supporting New DataFrame Types
270+
271+
1. Update dataframe_types.py to import new library conditionally
272+
2. Add to _available_types list if library is available
273+
3. Update get_dataframe_types() and get_available_library_names()
274+
4. Add tests for new library in test_optional_dependencies.py
275+
5. Add isolation scenario test in scripts/test_isolated_deps.py
276+
277+
### Modifying Configuration Options
278+
279+
1. Update config.py load_config() to parse new option
280+
2. Add accessor function (like get_strict())
281+
3. Update decorators to use new config option
282+
4. Add tests in test_config.py
283+
5. Document in README.md and docs/usage.md
284+
285+
## Important Constraints
286+
287+
- **Python 3.9+ compatibility**: Code must work on Python 3.9-3.14
288+
- **Type hints required**: All functions should have proper type annotations (Ruff ANN rules)
289+
- **No hard dependencies**: pandas and polars are optional; only tomli is required
290+
- **Coverage threshold**: 95% minimum (excluding dataframe_types.py)
291+
- **Import organization**: Use TYPE_CHECKING for static vs runtime type imports
292+
293+
## Version Management
294+
295+
- Version number is in pyproject.toml:3
296+
- Update CHANGELOG.md when making changes
297+
- Follow existing changelog format (see CHANGELOG.md for examples)
298+
- avoid comments that are obvious. aim to improve function or variable names to avoid comments

0 commit comments

Comments
 (0)