source-google-play: complete initial development#3136
Conversation
I've observed a pattern of source system representing `null` values with empty strings across a couple connectors. Instead of requiring the connector developer to remember to handle that transformation (which I don't immediately remember during development), the `BaseCSVRow` can be used to automatically handle null values.
This commit finishes the initial development for `source-google-play`. Some notable decisions made include: - Title casing all field names. The CSV column headers are not consistently named across files. Although I had hoped to avoid transformations as much as possible, ensuring fields are consistently named makes downstream processing easier for users plus it allows us to reuse more code in the connector (ex: primary keys are the same, model field definitions are simpler, etc.). - The `_overview` suffix is used for statistics files that aren't split on dimensions, while there is no suffix for reviews that aren't split on dimensions. There _are_ other files in the bucket containing data split by certain dimensions, and it's very easy to add another binding to capture these by overriding the `suffix` class variable for a given resource. Those additional bindings aren't needed right now, but they'll be easy to add in the future if someone asks for them later. - Reviews have an "updated_at" type of field that appears to always be present. This means that instead of yielding every row of an updated file, we can instead only yield rows that have been updated since the previous sweep. - The "Row Number" field doesn't need to be part of any `Statistics` primary key since the "Date" and "Package Name" uniquely identify a row already. No such combination of unique identifiers exist for "Reviews", so we still add "Row Number" into those documents.
Since the initial development of source-google-play is complete, these files are no longer needed/weren't needed in the first place.
Documentation for estuary/connectors#3136.
Documentation for estuary/connectors#3136.
JustinASmith
left a comment
There was a problem hiding this comment.
LGTM % one question
|
|
||
| return pattern | ||
|
|
||
| package_name: str = Field(alias=PACKAGE_NAME_FIELD) |
There was a problem hiding this comment.
Does the model need the ConfigDict set up with serialize_by_alias=True so that the serialized JSON uses the alias Package Name consistently. https://docs.pydantic.dev/latest/api/config/#pydantic.config.ConfigDict.serialize_by_alias says that version 2 of Pydantic this is False by default, so model.model_dump() would presumably give #> {'package_name': 'some.package.com'}.
As a dummy example I tried the following:
from pydantic import BaseModel, Field, model_validator, ValidationInfo
from typing import Any, ClassVar
PACKAGE_NAME_FIELD = "Package Name"
class Statistics(BaseModel, extra="allow"):
suffix: ClassVar[str | None] = "overview"
primary_keys: ClassVar[list[str]] = ["/Date", f"/{PACKAGE_NAME_FIELD}"]
date: str = Field(alias="Date")
@model_validator(mode="before")
@classmethod
def _normalize_field_names(cls, data: dict[str, Any], info: ValidationInfo) -> dict[str, Any]:
normalized_data: dict[str, Any] = {}
for key, value in data.items():
normalized_key = key.title()
normalized_data[normalized_key] = value
return normalized_data
s1 = Statistics(
Date="2023-10-01",
**{PACKAGE_NAME_FIELD: "com.example.app"}
)
s1.model_dump()
#> {'date': '2023-10-01', 'Package Name': 'com.example.app'}There was a problem hiding this comment.
Scratch that! I just realized the CDK handles this by default: https://github.com/search?q=repo%3Aestuary%2Fconnectors%20path%3A%2F%5Eestuary-cdk%5C%2F%2F%20by_alias&type=code
There was a problem hiding this comment.
@Alex-Bair Sorry I just thought of one more question. Was there a reason these acmeCo/*.yaml were removed?
There was a problem hiding this comment.
If we don't have a capture snapshot test, the various acmeCo/*.yaml files aren't needed since they're only used when we run the flowctl preview command. When we don't have a capture snapshot test, we can just leave bindings: [] (empty) in test.flow.yaml, and the spec & discover snapshot tests still work fine.
I haven't been extremely diligent to avoid adding these files when they're not needed, but I'm trying to keep it in mind more moving forward to avoid some clutter.
Documentation for estuary/connectors#3136.
Description:
This PR's scope includes:
BaseCSVRowclass in the CDK to make transforming null value representations in CSVs easier and more consistent across connectors.source-google-playthat was started in source-google-play: new connector #3091.reviewscan be incrementally captured within each file since each row has an "updated_at" type field.Workflow steps:
(How does one use this feature, and how has it changed)
Documentation links affected:
Documentation should be created for
source-google-play.Notes for reviewers:
Tested on a local stack. Confirmed:
This change is