Skip to content

source-google-play: new connector#3091

Merged
Alex-Bair merged 3 commits intomainfrom
bair/source-google-play
Jul 28, 2025
Merged

source-google-play: new connector#3091
Alex-Bair merged 3 commits intomainfrom
bair/source-google-play

Conversation

@Alex-Bair
Copy link
Copy Markdown
Member

@Alex-Bair Alex-Bair commented Jul 24, 2025

Description:

This PR's scope includes:

  • Add support for Google service accounts to the CDK.
  • Update most CDK-based connectors' poetry.lock because of ☝️.
  • Introduce a minimal & still in development version of source-google-play.
    • We do not have any credentials for Google Play yet, and once we do receive some credentials when someone sets up a production capture, development on this connector can be completed. There's a best-effort implementation in place, but it's based off of possibly incorrect/incomplete documentation, and there are a number of outstanding questions listed in TODO.md that need answered before this connector is production ready.

See individual commits for more details.

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

Documentation for the source-google-play connector should be created. I'd plan to hold off on creating that documentation until we're able to complete development.

Notes for reviewers:

Couldn't test on a local stack since we don't have any credentials. I tested to confirm general GCS related interactions work as intended, but I couldn't confirm anything related to how Google Play CSVs are structured or what data they contain.


This change is Reviewable

@Alex-Bair
Copy link
Copy Markdown
Member Author

I'll need to update the poetry.locks for CDK connectors since I'm adding google-auth as a dependency to the CDK. I'll tackle that tomorrow morning.

Comment on lines +11 to +14
# Braintree removed the `business_details`, `funding_details`, and `individual_details` for `merchant_accounts` in version 4.35.0 of their
# Python SDK, and I could not find any explanation for why they were removed. Customers still have these fields populated in the API responses,
# so I've pinned the `braintree` package version we're using to version 4.34.0.
braintree = "==4.34.0"
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own future reference, this commit is where these fields were removed from the braintree SDK.

Comment on lines -11 to -13
"enum": [
"OAuth Credentials"
],
Copy link
Copy Markdown
Member Author

@Alex-Bair Alex-Bair Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of enum from Pydantic's serialization for Literals with a single element was a bug fixed in pydantic/pydantic#10692.

I confirmed the removal of enum here doesn't affect rendering in the UI.

@Alex-Bair Alex-Bair force-pushed the bair/source-google-play branch 2 times, most recently from 87018cf to fb8b5de Compare July 25, 2025 20:54
@Alex-Bair Alex-Bair marked this pull request as ready for review July 25, 2025 20:54
@Alex-Bair Alex-Bair requested a review from williamhbaker July 25, 2025 20:54
@Alex-Bair Alex-Bair linked an issue Jul 25, 2025 that may be closed by this pull request
Copy link
Copy Markdown
Member

@williamhbaker williamhbaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I think this is a good starting point given the information we have available. Per my comments, we may need to work harder wrt efficiency in object listings and processing, or maybe not. But we can work through that when we have a better idea of what we are dealing with.

return

files: list[GCSFileMetadata] = []
async for file in gcs_client.list_all_files(prefix=model.prefix, globPattern=model.get_glob_pattern()):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be impractical to do a full file listing every time fetch_resources is invoked. Ideally there would be a way to walk the files in lexical order and read them as they come in from listed pages, emitting checkpoints along the way.

I'm not going to think too hard about that right now since I could be completely wrong and there is a trivial number of files to read & so we might as well not worry about it, but we can revisit later if needed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure if there was some flavor of eventual consistency where files containing data for previous months could be updated after their month has ended. That's why I left it as a full listing here instead of narrowing down the returned files with a globPattern or some other argument. Once I have some credentials, it should be trivial to check if that's possible by comparing the file name's YYYYMM portion to the file's updated metadata. Hopefully files aren't updated after their month has ended and I can use a glob pattern to only get the file for the log_cursor's month.

There's also possible semantic meaning within each CSV - statistics are suppose to have a date field and reviews are suppose to have some kind of "updated_at" field. I'd like to use those to avoid yielding every single row each time we read a file, but it's difficult to do that without knowing what data is actually in these files.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I'll rename list_all_files to something like list_files so it's clearer that it's not always a full file listing.

return

files: list[GCSFileMetadata] = []
async for file in gcs_client.list_all_files(prefix=model.prefix, globPattern=model.get_glob_pattern(cursor_month)):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the above, I'm not sure if it will be practical to do a full listing. And an entire month's worth of data sounds like it might be a lot, but also maybe not 🤷

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs say that each CSV contains data for a single month, and they'll have a YYYYMM portion in the file name to indicate which month's data it contains. I'm using that structure here to only get files for a single month with the globPattern. That should return a single file per Google Play app. I don't know how many Google Play apps folks usually have, but I was planning to check that out once we have some credentials.


# What hasn't been tested / outstanding questions
- `GCSClient.stream_csv`
- What's the dialect for the CSVs that are in the GCS bucket? Do we need to pass a custom `CSVConfig` into the `IncrementalCSVProcessor`?
Copy link
Copy Markdown
Member Author

@Alex-Bair Alex-Bair Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A selfish note that answers my own question: I suspect the CSVs use UTF-16 encoding since the docs state:

Tip: If you want to import your reports from Google Cloud Storage into BigQuery, you need to convert the CSV files from UTF-16 to UTF-8.

So a custom CSVConfig likely will be needed since the IncrementalCSVProcessor uses UTF-8 by default.

Google service account credentials will be needed for the upcoming
`source-google-play` connector. There are also other Google connectors
(like `source-google-analytics-data-api-native` and probably others)
we could update to support service accounts later too.

I considered not relying on the `google-auth` package and manually
exchange the service account JSON for an access token. But after digging
into how the `google-auth` package handles the process, it's a significant
amount of code that could be pretty fragile if Google decides to tweak
the process in any way.
Since the `google-auth` package was added to the CDK, all connectors
that use the CDK need their `poetry.lock` updated. Poetry would often
spin forever for native connectors, so I had to delete the existing
`poetry.lock` for most of these and sometimes make the Python contraint
more explicit (like for `source-pendo`).

Some other changes that were required after updating the `poetry.lock`s:
- I updated quite a few snapshots to include an explicit
  `additionalProperties: true` for the `_meta` field.
- I pinned the `braintree` package for `source-braintree-native` to
  version 4.34.0 since their SDK removed some fields in version 4.35.0.
This is a minmal implementation of a connector to capture data from
CSVs in Google Cloud Storage that contain Google Play data.

This connector actually isn't ready for production; we haven't
had valid credentials yet to develop against. I've made a best-effort
attempt at an implementation purely based on the Google Play docs, but
there's likely aspects of the documentation that are incorrect/incomplete
and any assumptions we make likely won't hold. Once we get some
credentials and see what the data actually looks like, we can finish
developing the connector.
@Alex-Bair Alex-Bair force-pushed the bair/source-google-play branch from fb8b5de to 163aedf Compare July 28, 2025 12:38
@Alex-Bair Alex-Bair merged commit 4db2114 into main Jul 28, 2025
100 of 104 checks passed
@Alex-Bair Alex-Bair deleted the bair/source-google-play branch July 28, 2025 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

new connector: source-google-play

2 participants