This guide covers the complete development workflow for article-extractor.
Audience: Contributors updating code, docs, or release workflows.
Prerequisites: Python 3.12+, uv, and Docker if you need container validation.
Time: ~30 minutes for code changes, longer if you run Docker smoke checks.
What you'll learn: Local setup, validation loop, and release steps.
# Clone the repository
git clone https://github.com/pankaj28843/article-extractor.git
cd article-extractor
# Install uv if not available
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install all dependencies
uv sync --all-extras --dev
# Install Playwright browsers (if using playwright extra)
uv run playwright install chromiumFor rapid iteration where you want the article-extractor command to reflect your local changes immediately:
# Install as editable tool (changes to src/ are reflected immediately)
uv tool install --editable --force --refresh --reinstall ".[all]"
# Now you can run without `uv run` prefix:
article-extractor --help
article-extractor https://example.com
article-extractor crawl --seed https://example.com --output-dir ./outputThis is especially useful when debugging extraction issues with real URLs. The --force --refresh --reinstall flags ensure a clean install that picks up all your local changes.
Note: When done debugging, remember that any URL-specific test cases should use
example.comor other generic domains—never leak real internal URLs into tests or documentation.
# Create a new branch
git checkout -b feature/your-feature-name
# Make sure tests pass
uv run pytest -v# Make your changes to code
# Format code (automatic fixes)
uv run ruff format .
# Lint code (automatic fixes where possible)
uv run ruff check --fix .
# Run tests with coverage
uv run pytest --cov=src/article_extractor --cov-report=term-missing -v
# Run specific test file
uv run pytest tests/test_extractor.py -vRun the pre-commit check script:
./scripts/pre-commit-check.shOr manually:
# 1. Format code
uv run ruff format .
# 2. Lint code
uv run ruff check --fix .
# 3. Run all tests
uv run pytest --cov=src/article_extractor --cov-report=term-missing -v
# 4. Verify coverage is 94%+
uv run coverage reportRun the automated Docker debug harness whenever you need to validate the container image and Playwright storage wiring end-to-end:
uv run scripts/debug_docker_deployment.pyThe shell entrypoint now delegates to scripts/debug_docker_deployment.py, which:
- Rebuilds
article-extractor:local(unless--skip-buildis supplied) - Deletes and recreates
tmp/docker-smoke-data/viaarticle_extractor.storage - Runs the container with a random published port and shared storage mount
- Waits for
/healthand then POSTs ~20 curated URLs in parallel (configurable via--urls-file) - Aggregates the HTTP status for every URL, printing excerpts for failures and retrying once per target by default (
--retries) - Verifies that
storage_state.jsonexists, is non-empty, and contains both Playwrightoriginsandcookies - Streams the final log tail and prints a ready-to-run
curlsnippet before cleaning up
You can pass any harness flag through the wrapper, for example uv run scripts/debug_docker_deployment.py --concurrency 8 --urls-file urls.txt. The underlying Python script also exposes --keep-container if you want to inspect the running container manually.
# Stage your changes
git add .
# Commit with descriptive message
git commit -m "feat: add new feature description"
# Push to your branch
git push origin feature/your-feature-name
# Create pull request on GitHubAll pull requests and commits to main trigger automated checks:
Test Job:
- Runs on Python 3.12, 3.13, 3.14 (Ubuntu latest)
- Executes full test suite with coverage reporting
- Enforces a ≥90% coverage threshold directly via pytest
- Uploads coverage to Codecov (Python 3.12 only)
Lint Job:
- Checks code formatting with Ruff
- Lints code with Ruff
- Optional: Type checking with mypy
Build Job:
- Builds Python package
- Verifies package can be imported
- Uploads build artifacts
- CodeQL (
.github/workflows/codeql.yml): Security scanning - Docker (
.github/workflows/docker.yml): Container builds - Publish (
.github/workflows/publish.yml): PyPI publishing - Security (
.github/workflows/security.yml): Dependency scanning
The repository includes comprehensive Copilot instructions to help you code faster:
.github/copilot-instructions.md- Complete project guide
.github/instructions/validation.instructions.md- For**/*.py.github/instructions/tests.instructions.md- Fortests/**/*.py.github/instructions/gh-cli.instructions.md- For GitHub CLI usage
Located in .github/prompts/:
prpPlanOnly.prompt.md- Planning mode (creates plan without code changes)cleanCodeRefactor.prompt.md- Rename/restructure without behavior changesbugFixRapidResponse.prompt.md- Quick surgical bug fixestestHardening.prompt.md- Improve test coverage and reliabilityiterativeCodeSimplification.prompt.md- Reduce LOC while maintaining behavior
- Open Copilot Chat
- Click "Attach context" icon
- Select "Prompt..." and choose a prompt file
- Add any additional context
- Submit the prompt
- Type Hints: Required on all function signatures
- Imports:
from __future__ import annotationsat top - Docstrings: Required for all public APIs
- Formatting: Handled by Ruff (runs automatically)
- Linting: Enforced by Ruff
Instance-Level State (CRITICAL):
# ❌ WRONG - Module-level mutable state
_cache = {}
# ✅ CORRECT - Instance-level state
class Processor:
__slots__ = ("_cache",)
def __init__(self):
self._cache = {}Async Patterns:
# Context managers
async with PlaywrightFetcher() as fetcher:
html, status = await fetcher.fetch(url)
# Return tuples
async def fetch(url: str) -> tuple[str, int]:
return html, status_codeError Handling:
# Return structured results, don't raise
try:
result = process()
return ArticleResult(success=True, ...)
except Exception as e:
return ArticleResult(success=False, error=str(e), ...)@pytest.mark.unit
class TestFeature:
def test_success_case(self, fixture):
result = function_under_test()
assert result.success is True
def test_failure_case(self):
result = function_with_bad_input()
assert result.success is False- Overall: 94%+ (enforced in CI)
- New Code: 100%
- Critical Paths: 100%
# All tests
uv run pytest -v
# Specific file
uv run pytest tests/test_extractor.py -v
# With coverage
uv run pytest --cov=src/article_extractor --cov-report=term-missing
# Generate HTML coverage report
uv run pytest --cov=src/article_extractor --cov-report=html
open htmlcov/index.html- Plan: Review existing code for similar patterns
- Test First: Write tests for new functionality
- Implement: Write code with type hints and docstrings
- Test: Ensure 100% coverage for new code
- Document: Update README if user-facing
- Lint: Run
uv run ruff format . && uv run ruff check --fix . - Verify: Run full test suite
- Reproduce: Write a failing test
- Fix: Implement the fix
- Verify: Ensure test passes
- Regression: Run full test suite
- Document: Add to CHANGELOG if significant
- Baseline: Run tests to establish passing state
- Small Steps: Make incremental changes
- Test After Each: Ensure tests still pass
- Coverage: Maintain or improve coverage
- Lint: Ensure code style maintained
- Update version in
pyproject.toml - Update CHANGELOG.md
- Create git tag:
git tag v0.2.0 - Push tag:
git push origin v0.2.0 - GitHub Actions will build and publish to PyPI
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Copilot: Ask in Copilot Chat with repository context
# Development
uv sync --all-extras --dev # Install all dependencies
uv run article-extractor <url> # Run CLI
uv run article-extractor --server # Run server
# Testing
uv run pytest -v # Run all tests
uv run pytest -k test_name # Run specific test
uv run pytest --cov # Run with coverage
uv run pytest -m unit # Run unit tests only
# Code Quality
uv run ruff format . # Format code
uv run ruff check --fix . # Lint and fix
./scripts/pre-commit-check.sh # Pre-commit checks
# Building
uv build # Build package
docker build -t article-extractor . # Build Docker imagearticle-extractor/
├── .github/
│ ├── copilot-instructions.md # Main Copilot instructions
│ ├── instructions/ # Path-specific instructions
│ │ ├── PRP-README.md # Planning template guide
│ │ ├── gh-cli.instructions.md # GitHub CLI usage
│ │ ├── techdocs.instructions.md # TechDocs research workflow
│ │ ├── tests.instructions.md # Testing guidelines
│ │ └── validation.instructions.md # Validation checklist
│ ├── prompts/ # Reusable prompt files
│ │ ├── prpPlanOnly.prompt.md # Planning mode
│ │ ├── cleanCodeRefactor.prompt.md
│ │ ├── bugFixRapidResponse.prompt.md
│ │ ├── testHardening.prompt.md
│ │ └── iterativeCodeSimplification.prompt.md
│ └── workflows/ # GitHub Actions
│ ├── ci.yml # Main CI/CD
│ ├── codeql.yml # Security scanning
│ ├── docker.yml # Container builds
│ ├── publish.yml # PyPI publishing
│ └── security.yml # Dependency scanning
├── src/article_extractor/ # Source code
├── tests/ # Test files
└── scripts/ # Helper scripts
└── pre-commit-check.sh # Pre-commit validation