Conversation
|
I'll need to update the |
14770c3 to
d24b61f
Compare
d24b61f to
9677ff0
Compare
| # Braintree removed the `business_details`, `funding_details`, and `individual_details` for `merchant_accounts` in version 4.35.0 of their | ||
| # Python SDK, and I could not find any explanation for why they were removed. Customers still have these fields populated in the API responses, | ||
| # so I've pinned the `braintree` package version we're using to version 4.34.0. | ||
| braintree = "==4.34.0" |
There was a problem hiding this comment.
For my own future reference, this commit is where these fields were removed from the braintree SDK.
| "enum": [ | ||
| "OAuth Credentials" | ||
| ], |
There was a problem hiding this comment.
The removal of enum from Pydantic's serialization for Literals with a single element was a bug fixed in pydantic/pydantic#10692.
I confirmed the removal of enum here doesn't affect rendering in the UI.
87018cf to
fb8b5de
Compare
williamhbaker
left a comment
There was a problem hiding this comment.
LGTM
I think this is a good starting point given the information we have available. Per my comments, we may need to work harder wrt efficiency in object listings and processing, or maybe not. But we can work through that when we have a better idea of what we are dealing with.
| return | ||
|
|
||
| files: list[GCSFileMetadata] = [] | ||
| async for file in gcs_client.list_all_files(prefix=model.prefix, globPattern=model.get_glob_pattern()): |
There was a problem hiding this comment.
I think it may be impractical to do a full file listing every time fetch_resources is invoked. Ideally there would be a way to walk the files in lexical order and read them as they come in from listed pages, emitting checkpoints along the way.
I'm not going to think too hard about that right now since I could be completely wrong and there is a trivial number of files to read & so we might as well not worry about it, but we can revisit later if needed.
There was a problem hiding this comment.
I was unsure if there was some flavor of eventual consistency where files containing data for previous months could be updated after their month has ended. That's why I left it as a full listing here instead of narrowing down the returned files with a globPattern or some other argument. Once I have some credentials, it should be trivial to check if that's possible by comparing the file name's YYYYMM portion to the file's updated metadata. Hopefully files aren't updated after their month has ended and I can use a glob pattern to only get the file for the log_cursor's month.
There's also possible semantic meaning within each CSV - statistics are suppose to have a date field and reviews are suppose to have some kind of "updated_at" field. I'd like to use those to avoid yielding every single row each time we read a file, but it's difficult to do that without knowing what data is actually in these files.
There was a problem hiding this comment.
And I'll rename list_all_files to something like list_files so it's clearer that it's not always a full file listing.
| return | ||
|
|
||
| files: list[GCSFileMetadata] = [] | ||
| async for file in gcs_client.list_all_files(prefix=model.prefix, globPattern=model.get_glob_pattern(cursor_month)): |
There was a problem hiding this comment.
Similar to the above, I'm not sure if it will be practical to do a full listing. And an entire month's worth of data sounds like it might be a lot, but also maybe not 🤷
There was a problem hiding this comment.
The docs say that each CSV contains data for a single month, and they'll have a YYYYMM portion in the file name to indicate which month's data it contains. I'm using that structure here to only get files for a single month with the globPattern. That should return a single file per Google Play app. I don't know how many Google Play apps folks usually have, but I was planning to check that out once we have some credentials.
|
|
||
| # What hasn't been tested / outstanding questions | ||
| - `GCSClient.stream_csv` | ||
| - What's the dialect for the CSVs that are in the GCS bucket? Do we need to pass a custom `CSVConfig` into the `IncrementalCSVProcessor`? |
There was a problem hiding this comment.
A selfish note that answers my own question: I suspect the CSVs use UTF-16 encoding since the docs state:
Tip: If you want to import your reports from Google Cloud Storage into BigQuery, you need to convert the CSV files from UTF-16 to UTF-8.
So a custom CSVConfig likely will be needed since the IncrementalCSVProcessor uses UTF-8 by default.
Google service account credentials will be needed for the upcoming `source-google-play` connector. There are also other Google connectors (like `source-google-analytics-data-api-native` and probably others) we could update to support service accounts later too. I considered not relying on the `google-auth` package and manually exchange the service account JSON for an access token. But after digging into how the `google-auth` package handles the process, it's a significant amount of code that could be pretty fragile if Google decides to tweak the process in any way.
Since the `google-auth` package was added to the CDK, all connectors that use the CDK need their `poetry.lock` updated. Poetry would often spin forever for native connectors, so I had to delete the existing `poetry.lock` for most of these and sometimes make the Python contraint more explicit (like for `source-pendo`). Some other changes that were required after updating the `poetry.lock`s: - I updated quite a few snapshots to include an explicit `additionalProperties: true` for the `_meta` field. - I pinned the `braintree` package for `source-braintree-native` to version 4.34.0 since their SDK removed some fields in version 4.35.0.
This is a minmal implementation of a connector to capture data from CSVs in Google Cloud Storage that contain Google Play data. This connector actually isn't ready for production; we haven't had valid credentials yet to develop against. I've made a best-effort attempt at an implementation purely based on the Google Play docs, but there's likely aspects of the documentation that are incorrect/incomplete and any assumptions we make likely won't hold. Once we get some credentials and see what the data actually looks like, we can finish developing the connector.
fb8b5de to
163aedf
Compare
Description:
This PR's scope includes:
poetry.lockbecause of ☝️.source-google-play.TODO.mdthat need answered before this connector is production ready.See individual commits for more details.
Workflow steps:
(How does one use this feature, and how has it changed)
Documentation links affected:
Documentation for the
source-google-playconnector should be created. I'd plan to hold off on creating that documentation until we're able to complete development.Notes for reviewers:
Couldn't test on a local stack since we don't have any credentials. I tested to confirm general GCS related interactions work as intended, but I couldn't confirm anything related to how Google Play CSVs are structured or what data they contain.
This change is