Skip to content

[SAC-30315] - Discovery, Sync Implementation#25

Open
mittal-tushar wants to merge 28 commits intomaster-rewritefrom
SAC-30315-feature/discovery-sync
Open

[SAC-30315] - Discovery, Sync Implementation#25
mittal-tushar wants to merge 28 commits intomaster-rewritefrom
SAC-30315-feature/discovery-sync

Conversation

@mittal-tushar
Copy link
Copy Markdown

@mittal-tushar mittal-tushar commented Mar 26, 2026

Description of Changes

Ticket: https://qlik-dev.atlassian.net/browse/SAC-30315
Results: https://qlik-dev.atlassian.net/browse/SAC-30315?focusedCommentId=1056700

Breaking Changes (v2.0.0 — Major Release)

Re-discovery and re-configuration required for existing connections.

Breaking Changes

Streams migrated:
lists and segments moved to v3 Marketing API endpoints (/v3/marketing/lists, /v3/marketing/segments)

Streams removed (deprecated v2 API endpoints):

  • contacts
  • lists_members
  • segments_members
  • campaigns

New streams added:

  • single_sends
  • single_send_stats
  • stats_automations
  • senders
  • marketing_contacts_count
  • marketing_field_definitions

Updated dependencies:

  • singer-python 5.13.2 → 6.1.1
  • backoff==2.2.1 added
  • pendulum and pytz removed

Manual QA Steps

  • Discovery: Running
  • Sync: Running
  • Unit Tests: Running
  • Integration Tests: Running

Rollback Steps

Revert the PR merge if needed.

mittal-tushar and others added 26 commits February 24, 2026 16:07
…ylint 10/10

- Modularized streams into dedicated files (suppression, marketing, contacts,
  senders, suppression_groups, templates) with a backward-compatible core.py shim
- Added 3 new streams: senders, marketing_contacts_count, marketing_field_definitions
- Removed deprecated campaigns stream (API returns 401 at account level)
- All 16 stream endpoints return HTTP 200 on the live SendGrid API
- Added full JSON schemas for all 16 streams
- Added unit tests (21 tests, 100% pass rate) covering client, discovery,
  sync, abstracts, streams registry, and entrypoint
- Added integration test scaffolding in tests/base.py for all 16 streams
- Achieved pylint 10.00/10 with full docstrings across all modules
- Rewrote README in standard Singer tap format with per-stream details
- Updated setup.py with extras_require[dev] and glob-based package_data
- Added CircleCI config using stitch-tap-tester-uv image with uv, coverage,
  and daily build schedule
- Added .gitignore, MANIFEST.in, CHANGELOG.md, LICENSE, PR template,
  and Copilot instructions
Schema changes:
- Remove root-level 'additionalProperties' from all 13 affected schemas
- Add 'format: date-time' to campaigns.created_at/updated_at, templates.updated_at, single_sends.send_at
- Normalise all schema files to consistent 2-space JSON indent

Bug fixes in abstracts.py:
- IncrementalStream.sync(): fix TypeError when cursor_type='datetime' (was calling datetime.fromtimestamp on an ISO string)
- IncrementalStream.sync() & FullTableStream.sync(): fix record count always returning 0 because Singer's Counter resets counter.value to 0 on __exit__; introduce separate record_count variable

Unit tests:
- tests/unittests/test_schemas.py: 23 parametrised tests covering valid JSON, root-level additionalProperties absent, date-time format, nullable types, UTF-8 encoding
- tests/unittests/test_sync_behavior.py: 9 tests for datetime cursor, unix cursor, full-table count accuracy, discovery metadata structure
…ntal schemas

The 'created' field in blocks, bounces, spam_reports, invalid_emails,
global_suppressions and suppression_group_members is typed as
[null, integer, string]. Since the union includes 'string' (used when
a bookmark ISO value flows through), Singer spec and project conventions
require format:date-time to be present.

Also strengthen test_schemas.py: add 'created' to DATETIME_STRING_FIELD_NAMES,
and add test_incremental_suppression_created_has_datetime_format to
explicitly assert the annotation is present on all 6 streams.
…in integration tests; install tap_sendgrid and pytest in CI integration step
- Remove format:date-time from suppression created fields (API returns
  Unix integer timestamps, not ISO strings; caused SchemaMismatch crash)
- Change suppression created type from [null,integer,string] to [null,integer]
- Remove created field from suppression_group_members (ASM endpoint returns
  email strings with no timestamp)
- Fix segments.json: rename contact_count->contacts_count; add missing
  fields: query_version, created_at, updated_at, sample_updated_at,
  next_sample_update, parent_list_id, status (verified against real API)
- Fix senders.json: verified is an object {status, reason}, not boolean
- Fix marketing_contacts_count.json: identifier_counts keys AID/EID/EXTID/PHNID
  match actual API response (not email/alternate_emails/phone_number_id/...)
- Remove orphan campaigns.json (no stream class; endpoint requires scopes
  not available; new Marketing Campaigns API covered by single_sends)
- Fix SuppressionGroupMembers: override get_records to handle email-string
  list response from /v3/asm/groups/{id}/suppressions
- Update unit tests to match new expected schema contracts: 171 passing
- Pylint 10.00/10 maintained
…ited attrs, drop redundant overrides

- abstracts.py: add replication_method/replication_keys defaults to
  FullTableStream and IncrementalStream so subclasses don't repeat them
- suppression.py: extract _SuppressionIncrementalStream base; each of
  the 5 suppression classes now only declares tap_stream_id and path
- marketing.py: remove replication_method/replication_keys from all 5
  classes (now inherited from FullTableStream)
- contacts.py: same inherited-attr removal; drop next_page_params
  overrides that only returned None (CursorPagedStream already handles
  non-cursor responses by returning None)
- senders.py: same cleanup; unused Any/Dict/Optional imports removed
- templates.py: inherited-attr removal
- suppression_groups.py: inherited-attr removal; drop redundant
  next_page_params on both classes; drop get_path override on
  SuppressionGroupMembers (identical to BaseStream.get_path)

No logic changes. Pylint 10/10, 171/171 unit tests pass,
discovery 16 streams, sync 16 SCHEMA + 36 RECORD + 31 STATE.
abstracts.py:
- BaseStream.page_size = 500 (suppression endpoints support up to 500;
  add comment explaining the split)
- FullTableStream.page_size = 50 overrides the base value (most SendGrid
  marketing/transactional APIs cap page_size at 100; 50 is a safe
  conservative default that works for all full-table endpoints)
- FullTableStream.default_params now uses self.page_size consistently,
  so any subclass can tune its page size via a class attribute

tests/base.py:
- Add inline comment above expected_metadata explaining that API_LIMIT
  is the page_size value tap-tester injects during the pagination test
  run (not a ceiling on total records returned); set to 1 so pagination
  is validated even when the test account holds as few as 2 records

Pylint 10/10, 171/171 unit tests, 16 SCHEMA + 36 RECORD + 31 STATE.
…andling, page_size

sync.py  _order_streams_for_resume:
  Replace bookmark-presence categorisation with circular ordering based on
  STREAMS dict position. Starting from currently_syncing and continuing
  in original dict order (wrapping at end) correctly handles the case
  where tap-tester merges the first-sync state with manipulate_state(),
  which puts all streams in bookmarks and caused 'already_completed'
  ordering to put blocks before global_suppressions.
  Fixes test_interrupted_sync_stream_order assertSetEqual failure.

schema.py  write_schema:
  - Remove streams_to_sync parameter (was always set(STREAMS.keys()) i.e.
    all streams, making the guard meaningless).
  - Guard against catalog.get_stream(child) == None (defensive: prevents
    AttributeError crash when tap-tester supplies a trimmed catalog).
  - Use child_obj.is_selected() to decide whether to add child to
    parent.child_to_sync. Prevents suppression_groups making API calls
    when neither it nor suppression_group_members is selected.

streams/abstracts.py  FullTableStream.default_params:
  Use self.page_size unconditionally (fixed at 50). Reading page_size
  from config is dangerous: if a user sets page_size=500, marketing
  endpoints that cap at 100 return HTTP 400. Incremental streams still
  read page_size from config (suppression endpoints support up to 500).

tests/test_all_fields.py:
  MISSING_FIELDS {}  removed stale {segments: {contact_count}} entry.
  Field was renamed contacts_count in schema and IS returned by the API.

tests/unittests/test_schema_extended.py:
  Update write_schema call to match new signature (remove streams_to_sync).

.circleci/config.yml:
  - mkdir -p test_output before pytest
  - --junitxml=test_output/report.xml so CircleCI test-results tab works
  - coverage report for inline text summary
  - store_test_results: path: test_output (directory, not a single file)

README.md: clarify page_size applies to incremental streams only
CHANGELOG.md: add 0.2.0 entry
setup.py: bump version to 0.2.0

Pylint 10/10, 171/171 unit tests, 16 SCHEMA + 37 RECORD + 31 STATE.
…t branch

Ensures integration tests run against the in-memory backend version of
tap-tester (without a live target), matching the environment used during
local development and CI validation of this PR.
- Move _schema_type and _generate_value into SendgridBaseTest for reuse
- Remove standalone SendgridDynamicMockGenerationTest class; fold mock
  validation into test_all_fields directly
- Expand bookmark test to cover all 5 incremental streams (was 1)
- Remove all module-level helper functions from test files
- Fix import order (third-party before local) for pylint compliance
- Fix W0221 arguments-differ in test_start_date (class attrs vs properties)
- Remove trailing newlines in all test files
- All 7 integration tests pass; pylint 10/10
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements a Singer-compatible tap-sendgrid with discovery + sync orchestration, stream implementations, JSON schemas, and a full unit/integration test suite to validate pagination, bookmarking, and schema correctness.

Changes:

  • Added core tap runtime: HTTP client (retry/rate-limit), discovery catalog building, and sync/resume orchestration.
  • Implemented SendGrid streams (incremental suppression, full-table marketing/templates/senders, suppression group parent/child) plus registry/shims.
  • Added JSON schemas, extensive unit tests, tap-tester integration tests, and CI configuration/docs.

Reviewed changes

Copilot reviewed 57 out of 60 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/unittests/test_sync_behavior.py Unit coverage for sync/bookmark/count behavior across cursor types and discovery metadata expectations.
tests/unittests/test_streams.py Validates stream registry and suppression group member normalization behavior.
tests/unittests/test_schemas.py Validates schema file presence/shape and key field constraints.
tests/unittests/test_schema_sync.py Unit tests for schema loading and sync orchestration helpers.
tests/unittests/test_schema_extended.py Tests child-stream schema setup behavior.
tests/unittests/test_entrypoint.py Tests CLI entrypoint dispatch for discover vs sync.
tests/unittests/test_discover.py Smoke test for discovery catalog containing expected streams.
tests/unittests/test_client_extended.py Additional client/error-handling and mock-mode tests.
tests/unittests/test_client.py Tests retry-after parsing, GET success path, and rate-limit error path.
tests/unittests/test_abstracts.py Tests abstract stream base behavior, pagination helpers, unix cursor params, and bookmark write.
tests/unittests/init.py Marks unittest package.
tests/test_start_date.py tap-tester integration test for start_date behavior.
tests/test_pagination.py tap-tester integration test for pagination behavior.
tests/test_interrupted_sync.py tap-tester integration test for resume/interrupted sync recovery.
tests/test_discovery.py tap-tester integration test for discovery expectations.
tests/test_bookmark.py tap-tester integration test for incremental bookmarking behavior.
tests/test_automatic_fields.py tap-tester integration test for automatic fields (PK) behavior.
tests/test_all_fields.py tap-tester integration test for all-fields selection behavior.
tests/base.py Base tap-tester harness for metadata expectations and value generation/parsing.
tests/init.py Marks integration test package.
tap_sendgrid/sync.py Implements sync loop with resume ordering + state writes.
tap_sendgrid/streams/templates.py Templates full-table stream with generations param.
tap_sendgrid/streams/suppression_groups.py Suppression groups parent + members child stream normalization.
tap_sendgrid/streams/suppression.py Incremental suppression streams using unix cursor.
tap_sendgrid/streams/senders.py Senders full-table stream with timestamp normalization to ISO.
tap_sendgrid/streams/marketing.py Marketing full-table streams (lists/segments/single sends/stats).
tap_sendgrid/streams/core.py Backward-compatible re-export shim for stream classes.
tap_sendgrid/streams/contacts.py Marketing contacts count + field definitions streams with custom parsing.
tap_sendgrid/streams/abstracts.py Base stream framework: pagination, incremental/full-table sync, bookmarking, counters.
tap_sendgrid/streams/init.py Stream registry + ordering warning for parent/child sync.
tap_sendgrid/schemas/templates.json Templates schema.
tap_sendgrid/schemas/suppression_groups.json Suppression groups schema.
tap_sendgrid/schemas/suppression_group_members.json Suppression group members schema.
tap_sendgrid/schemas/stats_automations.json Stats automations schema.
tap_sendgrid/schemas/spam_reports.json Spam reports schema.
tap_sendgrid/schemas/single_sends.json Single sends schema.
tap_sendgrid/schemas/single_send_stats.json Single send stats schema.
tap_sendgrid/schemas/senders.json Senders schema (timestamps as date-time strings).
tap_sendgrid/schemas/segments.json Segments schema (contacts_count + datetime fields).
tap_sendgrid/schemas/marketing_field_definitions.json Marketing field definitions schema.
tap_sendgrid/schemas/marketing_contacts_count.json Marketing contacts count schema.
tap_sendgrid/schemas/lists.json Lists schema.
tap_sendgrid/schemas/invalid_emails.json Invalid emails schema.
tap_sendgrid/schemas/global_suppressions.json Global suppressions schema.
tap_sendgrid/schemas/bounces.json Bounces schema.
tap_sendgrid/schemas/blocks.json Blocks schema.
tap_sendgrid/schema.py Loads schemas + constructs Singer metadata; sets up child schemas.
tap_sendgrid/exceptions.py Defines typed exception hierarchy + status-code mapping.
tap_sendgrid/discover.py Builds Singer Catalog from schemas/metadata.
tap_sendgrid/client.py HTTP client with auth, backoff, rate-limit handling, and mock-data mode.
tap_sendgrid/init.py Tap entrypoint: discover + main() CLI dispatch.
setup.py Packaging + dependencies + console script entrypoint.
README.md Tap documentation: streams, auth scopes, config, usage, testing.
MANIFEST.in Includes license and schema JSONs in sdist.
LICENSE Adds AGPLv3 license text.
CHANGELOG.md Adds initial changelog entry.
.gitignore Adds Python/test/build artifacts ignores.
.github/pull_request_template.md Adds PR template with AI-generated code checkbox/link.
.github/copilot-instructions.md Repo-specific implementation guidelines for Singer taps/targets.
.circleci/config.yml CI: lint, unit tests, coverage artifacts, and tap-tester integration tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +9
"""Unit tests for JSON schema files.

Validates that all schema files are:
- Valid JSON
- Properly structured (type, properties)
- Free of root-level additionalProperties
- Using date-time format on all datetime string fields
- Using consistent 2-space indentation
"""
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring says schemas are "Free of root-level additionalProperties", but test_schema_no_root_additional_properties enforces that additionalProperties (when present) must be true. Please reconcile the docs/comments so the intended rule is unambiguous.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed:

Validates that all schema files are:

setup.py Outdated
author="Stitch",
url="http://singer.io",
classifiers=["Programming Language :: Python :: 3 :: Only"],
py_modules=["tap_sendgrid"],
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

py_modules=["tap_sendgrid"] points to a top-level tap_sendgrid.py module, but this repo only has the tap_sendgrid/ package directory. This can break packaging/install builds. Remove py_modules (since packages=find_packages() already covers the package) or add the missing module file if that was intended.

Suggested change
py_modules=["tap_sendgrid"],

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed and outdated

if stripped.lstrip("-").isdigit():
return int(stripped)
dt = datetime.fromisoformat(stripped.replace("Z", "+00:00"))
return int(dt.replace(tzinfo=timezone.utc).timestamp())
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_unix_timestamp() uses dt.replace(tzinfo=timezone.utc) which changes the instant for inputs with a non-UTC offset (e.g. ...-05:00). This will compute the wrong epoch and can cause incorrect incremental start_time/bookmark behavior. Convert with dt.astimezone(timezone.utc) (or dt.timestamp() on an aware datetime) instead of replacing tzinfo.

Suggested change
return int(dt.replace(tzinfo=timezone.utc).timestamp())
if dt.tzinfo is None:
dt = dt.replace(tzinfo=timezone.utc)
else:
dt = dt.astimezone(timezone.utc)
return int(dt.timestamp())

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

@mittal-tushar mittal-tushar Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed and outdated

Comment on lines +229 to +258
bookmark = self.get_start_value(state)
if self.cursor_type == "unix" and isinstance(bookmark, str):
bookmark = to_unix_timestamp(bookmark)
current_max = bookmark
record_count = 0

with metrics.record_counter(self.tap_stream_id) as counter:
params = self.get_params_for_sync(bookmark)
for record in self.get_records(params=params, parent_obj=parent_obj):
transformed_record = transformer.transform(
copy.deepcopy(record), self.schema, self.metadata
)
record_value = self.normalize_record_cursor(transformed_record)
if record_value is None or record_value < bookmark:
continue

if self.is_selected():
write_record(self.tap_stream_id, transformed_record)
counter.increment()
record_count += 1

current_max = max(current_max, record_value)
for child in self.child_to_sync:
child.sync(state=state, transformer=transformer, parent_obj=record)

if self.cursor_type == "unix":
bk_str = datetime.fromtimestamp(current_max, tz=timezone.utc).isoformat()
else:
bk_str = str(current_max)
write_bookmark(state, self.tap_stream_id, self.replication_keys[0], bk_str)
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cursor_type == "datetime", bookmark filtering and current_max = max(...) rely on raw string comparison (record_value < bookmark). RFC3339 strings are not safely orderable lexicographically across equivalent-but-different representations (e.g. Z vs +00:00, different offsets), which can lead to skipped or duplicated records and incorrect bookmarks. Parse both bookmark and record cursor to timezone-aware datetime (or normalize to a canonical UTC string) before comparisons/max.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed:

def normalize_record_cursor(self, record: Dict[str, Any]) -> Any:

Comment on lines +14 to +22
bookmark_format = "%Y-%m-%dT%H:%M:%S%z"
initial_bookmarks = {
"bookmarks": {
"blocks": {"created": "2020-01-01T00:00:00.000000Z"},
"bounces": {"created": "2020-01-01T00:00:00.000000Z"},
"spam_reports": {"created": "2020-01-01T00:00:00.000000Z"},
"invalid_emails": {"created": "2020-01-01T00:00:00.000000Z"},
"global_suppressions":{"created": "2020-01-01T00:00:00.000000Z"},
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bookmark_format is %Y-%m-%dT%H:%M:%S%z, but initial_bookmarks values include fractional seconds and a trailing Z (e.g. 2020-01-01T00:00:00.000000Z). This mismatch is likely to fail tap-tester parsing/validation. Either adjust the values to match the declared format (e.g. ...+0000 without microseconds) or update bookmark_format to match the actual timestamp shape used.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed:

"blocks": {"created": "2020-01-01T00:00:00Z"},

Copy link
Copy Markdown

@RushiT0122 RushiT0122 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some request changes:

  • In schema files last lines are missing
  • Fix truncated changelog

Since your base branch is non-master branch, not able to identify breaking changes. Please highlight breaking changes and whether we need major/minor release.

@mittal-tushar
Copy link
Copy Markdown
Author

mittal-tushar commented Mar 31, 2026

Some request changes:

  • In schema files last lines are missing
  • Fix truncated changelog

Since your base branch is non-master branch, not able to identify breaking changes. Please highlight breaking changes and whether we need major/minor release.

Fixed:
a933890
and
3548424

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants