source-google-play: complete initial development by Alex-Bair · Pull Request #3136 · estuary/connectors

Alex-Bair · 2025-08-05T23:35:19Z

Description:

This PR's scope includes:

Adding a convenience BaseCSVRow class in the CDK to make transforming null value representations in CSVs easier and more consistent across connectors.
Completing development for source-google-play that was started in source-google-play: new connector #3091.
- Most of the changes were ones I anticipated on the initial implementation. See the commit message for more details, but the high level summary is:
  - reviews can be incrementally captured within each file since each row has an "updated_at" type field.
  - Column names are inconsistently cased, which would inevitably cause issues downstream. The predominant casing used for fields is Title Case, so I'm transforming all fields to abide by that convention.
  - We don't need to add "Row Number" to statistics since there already exists a sufficient composite primary key within each row.

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

Documentation should be created for source-google-play.

Notes for reviewers:

Tested on a local stack. Confirmed:

All expected documents are captured.
Backfills complete as expected.
Incremental tasks capture as expected, albeit I only watched it for an hour or so.
- These CSVs are likely updated once per day, so I'll keep an eye on production captures to make sure their incremental replication works as expected over multiple days.

This change is

I've observed a pattern of source system representing `null` values with empty strings across a couple connectors. Instead of requiring the connector developer to remember to handle that transformation (which I don't immediately remember during development), the `BaseCSVRow` can be used to automatically handle null values.

This commit finishes the initial development for `source-google-play`. Some notable decisions made include: - Title casing all field names. The CSV column headers are not consistently named across files. Although I had hoped to avoid transformations as much as possible, ensuring fields are consistently named makes downstream processing easier for users plus it allows us to reuse more code in the connector (ex: primary keys are the same, model field definitions are simpler, etc.). - The `_overview` suffix is used for statistics files that aren't split on dimensions, while there is no suffix for reviews that aren't split on dimensions. There _are_ other files in the bucket containing data split by certain dimensions, and it's very easy to add another binding to capture these by overriding the `suffix` class variable for a given resource. Those additional bindings aren't needed right now, but they'll be easy to add in the future if someone asks for them later. - Reviews have an "updated_at" type of field that appears to always be present. This means that instead of yielding every row of an updated file, we can instead only yield rows that have been updated since the previous sweep. - The "Row Number" field doesn't need to be part of any `Statistics` primary key since the "Date" and "Package Name" uniquely identify a row already. No such combination of unique identifiers exist for "Reviews", so we still add "Row Number" into those documents.

Since the initial development of source-google-play is complete, these files are no longer needed/weren't needed in the first place.

Documentation for estuary/connectors#3136.

JustinASmith

LGTM % one question

JustinASmith · 2025-08-07T13:24:32Z


+        return pattern
+
+    package_name: str = Field(alias=PACKAGE_NAME_FIELD)


Does the model need the ConfigDict set up with serialize_by_alias=True so that the serialized JSON uses the alias Package Name consistently. https://docs.pydantic.dev/latest/api/config/#pydantic.config.ConfigDict.serialize_by_alias says that version 2 of Pydantic this is False by default, so model.model_dump() would presumably give #> {'package_name': 'some.package.com'}.

As a dummy example I tried the following:

from pydantic import BaseModel, Field, model_validator, ValidationInfo from typing import Any, ClassVar PACKAGE_NAME_FIELD = "Package Name" class Statistics(BaseModel, extra="allow"): suffix: ClassVar[str | None] = "overview" primary_keys: ClassVar[list[str]] = ["/Date", f"/{PACKAGE_NAME_FIELD}"] date: str = Field(alias="Date") @model_validator(mode="before") @classmethod def _normalize_field_names(cls, data: dict[str, Any], info: ValidationInfo) -> dict[str, Any]: normalized_data: dict[str, Any] = {} for key, value in data.items(): normalized_key = key.title() normalized_data[normalized_key] = value return normalized_data s1 = Statistics( Date="2023-10-01", **{PACKAGE_NAME_FIELD: "com.example.app"} ) s1.model_dump() #> {'date': '2023-10-01', 'Package Name': 'com.example.app'}

Scratch that! I just realized the CDK handles this by default: https://github.com/search?q=repo%3Aestuary%2Fconnectors%20path%3A%2F%5Eestuary-cdk%5C%2F%2F%20by_alias&type=code

JustinASmith · 2025-08-07T13:29:07Z

@Alex-Bair Sorry I just thought of one more question. Was there a reason these acmeCo/*.yaml were removed?

If we don't have a capture snapshot test, the various acmeCo/*.yaml files aren't needed since they're only used when we run the flowctl preview command. When we don't have a capture snapshot test, we can just leave bindings: [] (empty) in test.flow.yaml, and the spec & discover snapshot tests still work fine.

I haven't been extremely diligent to avoid adding these files when they're not needed, but I'm trying to keep it in mind more moving forward to avoid some clutter.

Documentation for estuary/connectors#3136.

Alex-Bair added 3 commits August 5, 2025 19:23

source-google-play: remove unused development files

8fdac00

Since the initial development of source-google-play is complete, these files are no longer needed/weren't needed in the first place.

Alex-Bair changed the title ~~Bair/source google play complete initial development~~ source-google-play: complete initial development Aug 5, 2025

Alex-Bair added a commit to estuary/flow that referenced this pull request Aug 6, 2025

docs: new connector source-google-play

376d704

Documentation for estuary/connectors#3136.

Alex-Bair mentioned this pull request Aug 6, 2025

docs: new connector source-google-play estuary/flow#2318

Merged

Alex-Bair marked this pull request as ready for review August 6, 2025 00:06

Alex-Bair requested a review from JustinASmith August 6, 2025 00:06

Alex-Bair added a commit to estuary/flow that referenced this pull request Aug 6, 2025

docs: new connector source-google-play

d100ca5

Documentation for estuary/connectors#3136.

JustinASmith approved these changes Aug 7, 2025

View reviewed changes

JustinASmith reviewed Aug 7, 2025

View reviewed changes

Alex-Bair merged commit 09fdeed into main Aug 7, 2025
97 of 106 checks passed

Alex-Bair deleted the bair/source-google-play-complete-initial-development branch August 7, 2025 14:25

Alex-Bair added a commit to estuary/flow that referenced this pull request Aug 8, 2025

docs: new connector source-google-play

5778b57

Documentation for estuary/connectors#3136.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

source-google-play: complete initial development#3136

source-google-play: complete initial development#3136
Alex-Bair merged 3 commits intomainfrom
bair/source-google-play-complete-initial-development

Alex-Bair commented Aug 5, 2025 •

edited

Loading

Uh oh!

JustinASmith left a comment

Uh oh!

JustinASmith Aug 7, 2025

Uh oh!

JustinASmith Aug 7, 2025

Uh oh!

JustinASmith Aug 7, 2025

Uh oh!

Alex-Bair Aug 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		return pattern

		package_name: str = Field(alias=PACKAGE_NAME_FIELD)

Conversation

Alex-Bair commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JustinASmith left a comment

Choose a reason for hiding this comment

Uh oh!

JustinASmith Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

JustinASmith Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

JustinASmith Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Alex-Bair Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Alex-Bair commented Aug 5, 2025 •

edited

Loading