source-google-play: new connector by Alex-Bair · Pull Request #3091 · estuary/connectors

Alex-Bair · 2025-07-24T21:25:28Z

Description:

This PR's scope includes:

Add support for Google service accounts to the CDK.
Update most CDK-based connectors' poetry.lock because of ☝️.
Introduce a minimal & still in development version of source-google-play.
- We do not have any credentials for Google Play yet, and once we do receive some credentials when someone sets up a production capture, development on this connector can be completed. There's a best-effort implementation in place, but it's based off of possibly incorrect/incomplete documentation, and there are a number of outstanding questions listed in TODO.md that need answered before this connector is production ready.

See individual commits for more details.

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

Documentation for the source-google-play connector should be created. I'd plan to hold off on creating that documentation until we're able to complete development.

Notes for reviewers:

Couldn't test on a local stack since we don't have any credentials. I tested to confirm general GCS related interactions work as intended, but I couldn't confirm anything related to how Google Play CSVs are structured or what data they contain.

This change is

Alex-Bair · 2025-07-24T21:29:53Z

I'll need to update the poetry.locks for CDK connectors since I'm adding google-auth as a dependency to the CDK. I'll tackle that tomorrow morning.

Alex-Bair · 2025-07-25T20:20:59Z

+# Braintree removed the `business_details`, `funding_details`, and `individual_details` for `merchant_accounts` in version 4.35.0 of their
+# Python SDK, and I could not find any explanation for why they were removed. Customers still have these fields populated in the API responses,
+# so I've pinned the `braintree` package version we're using to version 4.34.0.
+braintree = "==4.34.0"


For my own future reference, this commit is where these fields were removed from the braintree SDK.

Alex-Bair · 2025-07-25T20:34:58Z

-              "enum": [
-                "OAuth Credentials"
-              ],


The removal of enum from Pydantic's serialization for Literals with a single element was a bug fixed in pydantic/pydantic#10692.

I confirmed the removal of enum here doesn't affect rendering in the UI.

williamhbaker

LGTM

I think this is a good starting point given the information we have available. Per my comments, we may need to work harder wrt efficiency in object listings and processing, or maybe not. But we can work through that when we have a better idea of what we are dealing with.

williamhbaker · 2025-07-25T21:24:56Z

+    return
+
+    files: list[GCSFileMetadata] = []
+    async for file in gcs_client.list_all_files(prefix=model.prefix, globPattern=model.get_glob_pattern()):


I think it may be impractical to do a full file listing every time fetch_resources is invoked. Ideally there would be a way to walk the files in lexical order and read them as they come in from listed pages, emitting checkpoints along the way.

I'm not going to think too hard about that right now since I could be completely wrong and there is a trivial number of files to read & so we might as well not worry about it, but we can revisit later if needed.

I was unsure if there was some flavor of eventual consistency where files containing data for previous months could be updated after their month has ended. That's why I left it as a full listing here instead of narrowing down the returned files with a globPattern or some other argument. Once I have some credentials, it should be trivial to check if that's possible by comparing the file name's YYYYMM portion to the file's updated metadata. Hopefully files aren't updated after their month has ended and I can use a glob pattern to only get the file for the log_cursor's month.

There's also possible semantic meaning within each CSV - statistics are suppose to have a date field and reviews are suppose to have some kind of "updated_at" field. I'd like to use those to avoid yielding every single row each time we read a file, but it's difficult to do that without knowing what data is actually in these files.

And I'll rename list_all_files to something like list_files so it's clearer that it's not always a full file listing.

williamhbaker · 2025-07-25T21:25:32Z

+    return
+
+    files: list[GCSFileMetadata] = []
+    async for file in gcs_client.list_all_files(prefix=model.prefix, globPattern=model.get_glob_pattern(cursor_month)):


Similar to the above, I'm not sure if it will be practical to do a full listing. And an entire month's worth of data sounds like it might be a lot, but also maybe not 🤷

The docs say that each CSV contains data for a single month, and they'll have a YYYYMM portion in the file name to indicate which month's data it contains. I'm using that structure here to only get files for a single month with the globPattern. That should return a single file per Google Play app. I don't know how many Google Play apps folks usually have, but I was planning to check that out once we have some credentials.

Alex-Bair · 2025-07-25T22:31:42Z

+
+# What hasn't been tested / outstanding questions
+- `GCSClient.stream_csv`
+  - What's the dialect for the CSVs that are in the GCS bucket? Do we need to pass a custom `CSVConfig` into the `IncrementalCSVProcessor`?


A selfish note that answers my own question: I suspect the CSVs use UTF-16 encoding since the docs state:

Tip: If you want to import your reports from Google Cloud Storage into BigQuery, you need to convert the CSV files from UTF-16 to UTF-8.

So a custom CSVConfig likely will be needed since the IncrementalCSVProcessor uses UTF-8 by default.

Google service account credentials will be needed for the upcoming `source-google-play` connector. There are also other Google connectors (like `source-google-analytics-data-api-native` and probably others) we could update to support service accounts later too. I considered not relying on the `google-auth` package and manually exchange the service account JSON for an access token. But after digging into how the `google-auth` package handles the process, it's a significant amount of code that could be pretty fragile if Google decides to tweak the process in any way.

Since the `google-auth` package was added to the CDK, all connectors that use the CDK need their `poetry.lock` updated. Poetry would often spin forever for native connectors, so I had to delete the existing `poetry.lock` for most of these and sometimes make the Python contraint more explicit (like for `source-pendo`). Some other changes that were required after updating the `poetry.lock`s: - I updated quite a few snapshots to include an explicit `additionalProperties: true` for the `_meta` field. - I pinned the `braintree` package for `source-braintree-native` to version 4.34.0 since their SDK removed some fields in version 4.35.0.

This is a minmal implementation of a connector to capture data from CSVs in Google Cloud Storage that contain Google Play data. This connector actually isn't ready for production; we haven't had valid credentials yet to develop against. I've made a best-effort attempt at an implementation purely based on the Google Play docs, but there's likely aspects of the documentation that are incorrect/incomplete and any assumptions we make likely won't hold. Once we get some credentials and see what the data actually looks like, we can finish developing the connector.

Alex-Bair force-pushed the bair/source-google-play branch from 14770c3 to d24b61f Compare July 25, 2025 16:16

Alex-Bair mentioned this pull request Jul 25, 2025

source-{various python connectors}: fix snapshot tests #3093

Merged

Alex-Bair force-pushed the bair/source-google-play branch from d24b61f to 9677ff0 Compare July 25, 2025 19:18

Alex-Bair commented Jul 25, 2025

View reviewed changes

Alex-Bair force-pushed the bair/source-google-play branch 2 times, most recently from 87018cf to fb8b5de Compare July 25, 2025 20:54

Alex-Bair marked this pull request as ready for review July 25, 2025 20:54

Alex-Bair requested a review from williamhbaker July 25, 2025 20:54

Alex-Bair linked an issue Jul 25, 2025 that may be closed by this pull request

new connector: source-google-play #2418

Closed

williamhbaker approved these changes Jul 25, 2025

View reviewed changes

Alex-Bair commented Jul 25, 2025

View reviewed changes

Alex-Bair added 3 commits July 28, 2025 08:37

Alex-Bair force-pushed the bair/source-google-play branch from fb8b5de to 163aedf Compare July 28, 2025 12:38

Alex-Bair merged commit 4db2114 into main Jul 28, 2025
100 of 104 checks passed

Alex-Bair deleted the bair/source-google-play branch July 28, 2025 13:48

Alex-Bair mentioned this pull request Jul 28, 2025

source-google-play: add to ci #3095

Merged

Alex-Bair mentioned this pull request Aug 5, 2025

source-google-play: complete initial development #3136

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

source-google-play: new connector#3091

source-google-play: new connector#3091
Alex-Bair merged 3 commits intomainfrom
bair/source-google-play

Alex-Bair commented Jul 24, 2025 •

edited

Loading

Uh oh!

Alex-Bair commented Jul 24, 2025

Uh oh!

Alex-Bair Jul 25, 2025

Uh oh!

Alex-Bair Jul 25, 2025 •

edited

Loading

Uh oh!

williamhbaker left a comment

Uh oh!

williamhbaker Jul 25, 2025

Uh oh!

Alex-Bair Jul 25, 2025

Uh oh!

Alex-Bair Jul 25, 2025

Uh oh!

williamhbaker Jul 25, 2025

Uh oh!

Alex-Bair Jul 25, 2025

Uh oh!

Alex-Bair Jul 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Alex-Bair commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alex-Bair commented Jul 24, 2025

Uh oh!

Alex-Bair Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Alex-Bair Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

williamhbaker left a comment

Choose a reason for hiding this comment

Uh oh!

williamhbaker Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Alex-Bair Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Alex-Bair Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

williamhbaker Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Alex-Bair Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Alex-Bair Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Alex-Bair commented Jul 24, 2025 •

edited

Loading

Alex-Bair Jul 25, 2025 •

edited

Loading

Alex-Bair Jul 25, 2025 •

edited

Loading