Skip to content

source-google-play: complete initial development#3136

Merged
Alex-Bair merged 3 commits intomainfrom
bair/source-google-play-complete-initial-development
Aug 7, 2025
Merged

source-google-play: complete initial development#3136
Alex-Bair merged 3 commits intomainfrom
bair/source-google-play-complete-initial-development

Conversation

@Alex-Bair
Copy link
Copy Markdown
Member

@Alex-Bair Alex-Bair commented Aug 5, 2025

Description:

This PR's scope includes:

  • Adding a convenience BaseCSVRow class in the CDK to make transforming null value representations in CSVs easier and more consistent across connectors.
  • Completing development for source-google-play that was started in source-google-play: new connector #3091.
    • Most of the changes were ones I anticipated on the initial implementation. See the commit message for more details, but the high level summary is:
      • reviews can be incrementally captured within each file since each row has an "updated_at" type field.
      • Column names are inconsistently cased, which would inevitably cause issues downstream. The predominant casing used for fields is Title Case, so I'm transforming all fields to abide by that convention.
      • We don't need to add "Row Number" to statistics since there already exists a sufficient composite primary key within each row.

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

Documentation should be created for source-google-play.

Notes for reviewers:

Tested on a local stack. Confirmed:

  • All expected documents are captured.
  • Backfills complete as expected.
  • Incremental tasks capture as expected, albeit I only watched it for an hour or so.
    • These CSVs are likely updated once per day, so I'll keep an eye on production captures to make sure their incremental replication works as expected over multiple days.

This change is Reviewable

I've observed a pattern of source system representing `null` values with empty strings across a couple connectors. Instead of requiring the connector developer to remember to handle that transformation (which I don't immediately remember during development), the `BaseCSVRow` can be used to automatically handle null values.
This commit finishes the initial development for `source-google-play`. Some notable decisions made include:
- Title casing all field names. The CSV column headers are not
  consistently named across files. Although I had hoped to avoid
  transformations as much as possible, ensuring fields are consistently
  named makes downstream processing easier for users plus it allows us
  to reuse more code in the connector (ex: primary keys are the same,
  model field definitions are simpler, etc.).
- The `_overview` suffix is used for statistics files that aren't split
  on dimensions, while there is no suffix for reviews that aren't split
  on dimensions. There _are_ other files in the bucket containing data
  split by certain dimensions, and it's very easy to add another binding
  to capture these by overriding the `suffix` class variable for a given
  resource. Those additional bindings aren't needed right now, but
  they'll be easy to add in the future if someone asks for them later.
- Reviews have an "updated_at" type of field that appears to always be
  present. This means that instead of yielding every row of an updated
  file, we can instead only yield rows that have been updated since the
  previous sweep.
- The "Row Number" field doesn't need to be part of any `Statistics`
  primary key since the "Date" and "Package Name" uniquely identify a
  row already. No such combination of unique identifiers exist for
  "Reviews", so we still add "Row Number" into those documents.
Since the initial development of source-google-play is complete, these
files are no longer needed/weren't needed in the first place.
@Alex-Bair Alex-Bair changed the title Bair/source google play complete initial development source-google-play: complete initial development Aug 5, 2025
Alex-Bair added a commit to estuary/flow that referenced this pull request Aug 6, 2025
@Alex-Bair Alex-Bair marked this pull request as ready for review August 6, 2025 00:06
@Alex-Bair Alex-Bair requested a review from JustinASmith August 6, 2025 00:06
Alex-Bair added a commit to estuary/flow that referenced this pull request Aug 6, 2025
Copy link
Copy Markdown
Contributor

@JustinASmith JustinASmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % one question


return pattern

package_name: str = Field(alias=PACKAGE_NAME_FIELD)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the model need the ConfigDict set up with serialize_by_alias=True so that the serialized JSON uses the alias Package Name consistently. https://docs.pydantic.dev/latest/api/config/#pydantic.config.ConfigDict.serialize_by_alias says that version 2 of Pydantic this is False by default, so model.model_dump() would presumably give #> {'package_name': 'some.package.com'}.

As a dummy example I tried the following:

from pydantic import BaseModel, Field, model_validator, ValidationInfo
from typing import Any, ClassVar

PACKAGE_NAME_FIELD = "Package Name"

class Statistics(BaseModel, extra="allow"):
    suffix: ClassVar[str | None] = "overview"
    primary_keys: ClassVar[list[str]] = ["/Date", f"/{PACKAGE_NAME_FIELD}"]
    date: str = Field(alias="Date")

    @model_validator(mode="before")
    @classmethod
    def _normalize_field_names(cls, data: dict[str, Any], info: ValidationInfo) -> dict[str, Any]:
        normalized_data: dict[str, Any] = {}
        for key, value in data.items():
            normalized_key = key.title()
            normalized_data[normalized_key] = value

        return normalized_data
        
s1 = Statistics(
    Date="2023-10-01",
    **{PACKAGE_NAME_FIELD: "com.example.app"}
)

s1.model_dump()

#> {'date': '2023-10-01', 'Package Name': 'com.example.app'}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Alex-Bair Sorry I just thought of one more question. Was there a reason these acmeCo/*.yaml were removed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have a capture snapshot test, the various acmeCo/*.yaml files aren't needed since they're only used when we run the flowctl preview command. When we don't have a capture snapshot test, we can just leave bindings: [] (empty) in test.flow.yaml, and the spec & discover snapshot tests still work fine.

I haven't been extremely diligent to avoid adding these files when they're not needed, but I'm trying to keep it in mind more moving forward to avoid some clutter.

@Alex-Bair Alex-Bair merged commit 09fdeed into main Aug 7, 2025
97 of 106 checks passed
@Alex-Bair Alex-Bair deleted the bair/source-google-play-complete-initial-development branch August 7, 2025 14:25
Alex-Bair added a commit to estuary/flow that referenced this pull request Aug 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants