source-apple-app-store: finalize implementation by JustinASmith · Pull Request #3216 · estuary/connectors

JustinASmith · 2025-08-25T23:33:29Z

Description:

This PR finalizes the implementation of the Apple App store capture connector after seeing it used and finding out more about the quirks in Apple's API and asynchronous Analytics Report endpoints.

Introduces a GunzipStream class in the CDK for streaming uncompressed bytes from a Gzip encoded file returned from APIs with the Content-Type header set to application/gzip. This is not automatically decompressed by aiohttp and requires manual decompression, which this class will do via the built in zlib Python library.

Additionally, this PR simplifies the analytics report incremental streams page and log cursors to only change when we process a report. I.e., We assume (and have observed this behavior thus far) the analytics reports are processed and provided via the API in chronological / monotonically increasing order. Upon seeing a report with a processingDate within our window we are backfilling or incrementally capturing data we process that report and then yield that as the next cursor. Backfill will stop when the last processed date is equal to the cutoff.

Lastly, the collection key for analytics report documents are selected to be the filename path provided by Apple in the AWS-Signed URL when finding report instances to process. This is combined with a synthetic row_number as an offset for that particular file. For example: {...., "filename":"reports/<app_id>/discovery_and_engagement_detail/daily/snapshot/<date>/<some_id>/<some_unique_filename>.csv.gz","row_number":42936}. This ensures we have a unique and distinguishable collection key. Note: when working with the data it was observed that certain fields can be null when they are expected to be valid string values that were required for a proper composite key made up of the file's data. Because of that reason, I have selected to use the filename/row_number as stated.

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

New documentation: estuary/flow#2364

Notes for reviewers:

(anything that might help someone review this PR)

This change is

Introduces a new utility that leverages the built-in `zlib` library and providies the ability to stream decompressed chunks from a GZIP file without loading the entire archive into memory.

Report instance segments are signed URLs with URL parameters set for authentication. Setting authorization headers on top of this causes HTTP 400 errors. This update ensures we do not use the JWT token in this request. Additionally, the response of the request is a Gzip encoded file with the `Content-Type` header set to `application/gzip` which cannot be automatically decoded by `aiohttp`. The `GunzipStream` is set up to handle the decompression instead.

…nd use detailed analytics reports Uses filename and row_number for the document keys for Analytics Reports. This is shown to be reliable compared to using columns from the analytics report files themselves. This also updates the models to use the "detailed" reports, which offer more information over the "standard".

Alex-Bair

LGTM! % a question around backfill completion when there's no ongoing report. I'm hazy on the various delays Apple can have in creating various reports, so I may be missing a reason for the backfill to finish early if no ongoing report exists.

Alex-Bair · 2025-08-26T01:33:35Z

        params: dict[str, Any] | None = None,
        json: dict[str, Any] | None = None,
        form: dict[str, Any] | None = None,
+        _with_token: bool = True,  # Unstable internal API.


nit: Since _with_token is now part of the public interface, could the leading underscore be removed so it's just with_token?

Alex-Bair · 2025-08-26T01:39:58Z

+                if not chunk:
+                    continue


nit: When would chunk be falsy and cause us to hit continue? From the GunzipStream implementation, it looks like it always yields a chunk of some kind.

Alex-Bair · 2025-08-26T01:50:33Z

+
+    @model_validator(mode="after")
+    def extract_filename_from_url(self):
+        from urllib.parse import urlparse


nit: What was the motivation to import urlparse within the function rather than at the top of the file? IMO it's easier to track imports when they're grouped together at the top.

Alex-Bair · 2025-08-26T01:53:31Z

+import csv
+import os


nit: It looks like the csv and os imports are unused in this file.

Alex-Bair · 2025-08-26T13:06:37Z


-    app_id: str
    record_date: date = Field(..., alias="Date")
+    filename: str = Field(default="")


nit: Does filename need to have a default value in the model? The _add_row_metadata before model validator looks like it always adds filename before validation and the default value is never used.

Alex-Bair · 2025-08-26T13:22:53Z

+    if not ongoing_report_exists:
+        log.warning(
+            f"Skipping backfill for {model.report_name} since no ONGOING report request exists. "
+            "Backfill requires an existing ONGOING report request to ensure proper data continuity.",
+            {
+                "app_id": app_id,
+                "report_name": model.report_name,
+            },
        )
+        return


If there's not an ongoing report, does this mean the backfill completes early/doesn't capture any data? From the code, I think that's what's happening. Completing the backfill without capturing any data sounds like it'd be difficult to determine whether a completed backfill means "we got all historical data" vs. "we skipped the backfill & someone would need to re-backfill later to get historical data". Instead, should the connector wait somehow for an ongoing report to exist, then capture data for the backfill?

…nd processing strategy Minor refactors to use the updated models, fields and report processing flow. Instead of a complicated paging strategy that waits X amount of days based on some completeness lag, the reports are processed as long as there is a report with a processing data <= `cutoff`.

Docs added for estuary/connectors#3216

JustinASmith added 3 commits August 25, 2025 18:21

estuary-cdk: add GunzipStream for stream processing Gzip files

d2826ac

Introduces a new utility that leverages the built-in `zlib` library and providies the ability to stream decompressed chunks from a GZIP file without loading the entire archive into memory.

JustinASmith force-pushed the js/source-apple-app-store-finalize branch from 67003ec to 590e602 Compare August 25, 2025 23:35

JustinASmith requested a review from Alex-Bair August 25, 2025 23:35

Alex-Bair approved these changes Aug 26, 2025

View reviewed changes

JustinASmith force-pushed the js/source-apple-app-store-finalize branch from 54a5faa to d2b4cd5 Compare August 26, 2025 16:23

JustinASmith force-pushed the js/source-apple-app-store-finalize branch from d2b4cd5 to e8927c4 Compare August 26, 2025 16:28

JustinASmith added a commit to estuary/flow that referenced this pull request Aug 26, 2025

source-apple-app-store: add docs for new connector

fd95376

Docs added for estuary/connectors#3216

JustinASmith mentioned this pull request Aug 26, 2025

docs: source-apple-app-store estuary/flow#2364

Merged

JustinASmith merged commit f7a2cae into main Aug 26, 2025
89 of 108 checks passed

JustinASmith added a commit to estuary/flow that referenced this pull request Aug 26, 2025

source-apple-app-store: add docs for new connector

63a28a2

Docs added for estuary/connectors#3216

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

source-apple-app-store: finalize implementation#3216

source-apple-app-store: finalize implementation#3216
JustinASmith merged 4 commits intomainfrom
js/source-apple-app-store-finalize

JustinASmith commented Aug 25, 2025 •

edited

Loading

Uh oh!

Alex-Bair left a comment

Uh oh!

Alex-Bair Aug 26, 2025

Uh oh!

Alex-Bair Aug 26, 2025

Uh oh!

Alex-Bair Aug 26, 2025

Uh oh!

Alex-Bair Aug 26, 2025

Uh oh!

Uh oh!

Alex-Bair Aug 26, 2025

Uh oh!

Alex-Bair Aug 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JustinASmith commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alex-Bair left a comment

Choose a reason for hiding this comment

Uh oh!

Alex-Bair Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Alex-Bair Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Alex-Bair Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Alex-Bair Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Alex-Bair Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Alex-Bair Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JustinASmith commented Aug 25, 2025 •

edited

Loading