Skip to content

Latest commit

 

History

History
390 lines (284 loc) · 12.2 KB

File metadata and controls

390 lines (284 loc) · 12.2 KB

AGENTS.md - AI Agent Development Guide for tap-pinterest

This document provides guidance for AI coding agents and developers working on this Singer tap.

Project Overview

  • Project Type: Singer Tap
  • Source: Pinterest
  • Stream Type: REST
  • Authentication: OAuth2
  • Framework: Meltano Singer SDK

Architecture

This tap follows the Singer specification and uses the Meltano Singer SDK to extract data from Pinterest.

Key Components

  1. Tap Class (tap_pinterest/tap.py): Main entry point, defines streams and configuration
  2. Client (tap_pinterest/client.py): Handles API communication and authentication
  3. Streams (tap_pinterest/streams.py): Define data streams and their schemas
    1. Authentication (tap_pinterest/auth.py): Implements OAuth2 authentication logic

    Development Guidelines for AI Agents

Understanding Singer Concepts

Before making changes, ensure you understand these Singer concepts:

  • Streams: Individual data endpoints (e.g., users, orders, transactions)
  • State: Tracks incremental sync progress using bookmarks
  • Catalog: Metadata about available streams and their schemas
  • Records: Individual data items emitted by the tap
  • Schemas: JSON Schema definitions for stream data

Common Tasks

Adding a New Stream

  1. Define stream class in tap_pinterest/streams.py
  2. Set name, path, primary_keys, and replication_key (set this to None if not applicable)
  3. Define schema using PropertiesList or JSON Schema
  4. Register stream in the tap's discover_streams() method

Example:

class MyNewStream(PinterestStream):
    name = "my_new_stream"
    path = "/api/v1/my_resource"
    primary_keys = ["id"]
    replication_key = "updated_at"

    schema = PropertiesList(
        Property("id", StringType, required=True),
        Property("name", StringType),
        Property("updated_at", DateTimeType),
    ).to_dict()

Modifying Authentication

  • Implements OAuth2 flow in auth.py
  • Handles token refresh automatically
  • Requires client_id, client_secret, and redirect_uri

    Handling Pagination

The SDK provides built-in pagination classes. Use these instead of overriding get_next_page_token() directly.

Built-in Paginator Classes:

  1. SimpleHeaderPaginator: For APIs using Link headers (RFC 5988)

    from singer_sdk.pagination import SimpleHeaderPaginator
    
    class MyStream(PinterestStream):
        def get_new_paginator(self):
            return SimpleHeaderPaginator()
  2. HeaderLinkPaginator: For APIs with Link: <url>; rel="next" headers

    from singer_sdk.pagination import HeaderLinkPaginator
    
    class MyStream(PinterestStream):
        def get_new_paginator(self):
            return HeaderLinkPaginator()
  3. JSONPathPaginator: For cursor/token in response body

    from singer_sdk.pagination import JSONPathPaginator
    
    class MyStream(PinterestStream):
        def get_new_paginator(self):
            return JSONPathPaginator("$.pagination.next_token")
  4. SinglePagePaginator: For non-paginated endpoints

    from singer_sdk.pagination import SinglePagePaginator
    
    class MyStream(PinterestStream):
        def get_new_paginator(self):
            return SinglePagePaginator()

Creating Custom Paginators:

For complex pagination logic, create a custom paginator class:

from singer_sdk.pagination import PageNumberPaginator

class MyCustomPaginator(PageNumberPaginator):
    def has_more(self, response):
        """Check if there are more pages."""
        data = response.json()
        return data.get("has_more", False)

    def get_next_url(self, response):
        """Get the next page URL."""
        data = response.json()
        if self.has_more(response):
            return data.get("next_url")
        return None

# Use in stream
class MyStream(PinterestStream):
    def get_new_paginator(self):
        return MyCustomPaginator(start_value=1)

Common Pagination Patterns:

  • Offset-based: Use OffsetPaginator
  • Page-based: Use PageNumberPaginator
  • Cursor-based: Use or extend JSONPathPaginator
  • HATEOAS/HAL: Extend BaseHATEOASPaginator with a custom get_next_url() method to extract the next URL from the response.

Only override get_next_page_token() as a last resort for very simple cases.

State and Incremental Sync

  • Set replication_key to enable incremental sync (e.g., "updated_at")
  • Override get_starting_timestamp() to set initial sync point
  • State automatically managed by SDK
  • Access current state via get_context_state()

Schema Evolution

  • Use flexible schemas during development
  • Add new properties without breaking changes
  • Consider making fields optional when unsure
  • Use th.Property("field", th.StringType) for basic types
  • Nest objects with th.ObjectType(...)

Testing

Run tests to verify your changes:

# Install dependencies
uv sync

# Run all tests
uv run pytest

# Run specific test
uv run pytest tests/test_core.py -k test_name

Configuration

Configuration properties are defined in the tap class:

  • Required vs optional properties
  • Secret properties (passwords, tokens)
  • Mark sensitive data with secret=True parameter
  • Defaults specified in config schema

Example configuration schema:

from singer_sdk import typing as th

config_jsonschema = th.PropertiesList(
    th.Property("api_url", th.StringType, required=True),
    th.Property("api_key", th.StringType, required=True, secret=True),
    th.Property("start_date", th.DateTimeType),
).to_dict()

Example test with config:

tap-pinterest --config config.json --discover
tap-pinterest --config config.json --catalog catalog.json

Keeping meltano.yml and Tap Settings in Sync

When this tap is used with Meltano, the settings defined in meltano.yml must stay in sync with the config_jsonschema in the tap class. Configuration drift between these two sources causes confusion and runtime errors.

When to sync:

  • Adding new configuration properties to the tap
  • Removing or renaming existing properties
  • Changing property types, defaults, or descriptions
  • Marking properties as required or secret

How to sync:

  1. Update config_jsonschema in tap_pinterest/tap.py
  2. Update the corresponding settings block in meltano.yml
  3. Update .env.example with the new environment variable

Example - adding a new batch_size setting:

# tap_pinterest/tap.py
config_jsonschema = th.PropertiesList(
    th.Property("api_url", th.StringType, required=True),
    th.Property("api_key", th.StringType, required=True, secret=True),
    th.Property("batch_size", th.IntegerType, default=100),  # New setting
).to_dict()
# meltano.yml
plugins:
  extractors:
    - name: tap-pinterest
      settings:
        - name: api_url
          kind: string
        - name: api_key
          kind: string
          sensitive: true
        - name: batch_size  # New setting
          kind: integer
          value: 100
# .env.example
TAP_PINTEREST_API_URL=https://api.example.com
TAP_PINTEREST_API_KEY=your_api_key_here
TAP_PINTEREST_BATCH_SIZE=100  # New setting

Setting kind mappings:

Python Type Meltano Kind
StringType string
IntegerType integer
BooleanType boolean
NumberType number
DateTimeType date_iso8601
ArrayType array
ObjectType object

Any properties with secret=True should be marked with sensitive: true in meltano.yml.

Best practices:

  • Always update all three files (tap.py, meltano.yml, .env.example) in the same commit
  • Use the same default values in all locations
  • Keep descriptions consistent between code docstrings and meltano.yml description fields

Note: This guidance is consistent with target and mapper templates in the Singer SDK. See the SDK documentation for canonical reference.

Common Pitfalls

  1. Rate Limiting: Implement backoff using RESTStream built-in retry logic
  2. Large Responses: Use pagination, don't load entire dataset into memory
  3. Schema Mismatches: Validate data matches schema, handle null values
  4. State Management: Don't modify state directly, use SDK methods
  5. Timezone Handling: Use UTC, parse ISO 8601 datetime strings
  6. Error Handling: Let SDK handle retries, log warnings for data issues

SDK Resources

Best Practices

  1. Logging: Use self.logger for structured logging
  2. Validation: Validate API responses before emitting records
  3. Documentation: Update README with new streams and config options
  4. Type Hints: Add type hints to improve code clarity
  5. Testing: Write tests for new streams and edge cases
  6. Performance: Profile slow streams, optimize API calls
  7. Error Messages: Provide clear, actionable error messages

File Structure

tap-pinterest/
├── tap_pinterest/
│   ├── __init__.py
│   ├── tap.py          # Main tap class
│   ├── client.py       # API client
│   └── streams.py      # Stream definitions
├── tests/
│   ├── __init__.py
│   └── test_core.py
├── config.json         # Example configuration
├── pyproject.toml      # Dependencies and metadata
└── README.md          # User documentation

Additional Resources

Making Changes

When implementing changes:

  1. Understand the existing code structure
  2. Follow Singer and SDK patterns
  3. Test thoroughly with real API credentials
  4. Update documentation and docstrings
  5. Ensure backward compatibility when possible
  6. Run linting and type checking

Questions?

If you're uncertain about an implementation:

  • Check SDK documentation for similar examples
  • Review other Singer taps for patterns
  • Test incrementally with small changes
  • Validate against the Singer specification

Bumping the Singer SDK Version

When upgrading the singer-sdk dependency in pyproject.toml, follow these steps to avoid breaking changes:

  1. Check the deprecation guide before upgrading: https://sdk.meltano.com/en/latest/deprecation.html

    The deprecation page lists APIs scheduled for removal in each release, along with migration instructions. Review the entries for every version between your current version and the target version.

  2. Update the dependency in pyproject.toml:

    [project]
    dependencies = [
        "singer-sdk~=X.Y",  # Bump to the new version
    ]
  3. Re-sync your environment and run the full test suite:

    uv sync
    uv run pytest
  4. Address deprecation warnings: Run with warnings enabled to catch anything that will become an error in a future release:

    uv run pytest -W error::DeprecationWarning
  5. Check the changelog for any behavioral changes that affect your tap, even if not surfaced by warnings (e.g. pagination, authentication, state handling).

Reporting SDK Issues

If you encounter a bug or missing feature in the Meltano Singer SDK itself (not in this tap), please open an issue at https://github.com/meltano/sdk/issues/new/choose.

Use the appropriate issue template:

  • Bug Report (bug.yml): For unexpected behavior, errors, or regressions in the SDK. Fill in the Singer SDK version, Python version, bug scope (e.g. "Taps"), and a clear description with reproduction steps.
  • Feature Request (feature.yml): For new SDK capabilities you'd like to see. Select the relevant scope and describe the desired behavior.
  • Documentation (docs.yml): For incorrect or missing documentation.

Before filing, search existing issues to avoid duplicates. Include the SDK version (uv run tap-pinterest --version), Python version, and a minimal reproduction case when reporting bugs.