Skip to content

Commit 4db2114

Browse files
committed
source-google-play: new connector
This is a minmal implementation of a connector to capture data from CSVs in Google Cloud Storage that contain Google Play data. This connector actually isn't ready for production; we haven't had valid credentials yet to develop against. I've made a best-effort attempt at an implementation purely based on the Google Play docs, but there's likely aspects of the documentation that are incorrect/incomplete and any assumptions we make likely won't hold. Once we get some credentials and see what the data actually looks like, we can finish developing the connector.
1 parent c4837f3 commit 4db2114

21 files changed

+5444
-0
lines changed

source-google-play/VERSION

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
v1
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
---
2+
$defs:
3+
Meta:
4+
properties:
5+
op:
6+
default: u
7+
description: "Operation type (c: Create, u: Update, d: Delete)"
8+
enum:
9+
- c
10+
- u
11+
- d
12+
title: Op
13+
type: string
14+
row_id:
15+
default: -1
16+
description: "Row ID of the Document, counting up from zero, or -1 if not known"
17+
title: Row Id
18+
type: integer
19+
title: Meta
20+
type: object
21+
additionalProperties: true
22+
properties:
23+
_meta:
24+
$ref: "#/$defs/Meta"
25+
default:
26+
op: u
27+
row_id: -1
28+
description: Document metadata
29+
package_name:
30+
title: Package Name
31+
type: string
32+
row_number:
33+
title: Row Number
34+
type: integer
35+
date:
36+
title: Date
37+
type: string
38+
required:
39+
- package_name
40+
- row_number
41+
- date
42+
title: Crashes
43+
type: object
44+
x-infer-schema: true
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
---
2+
collections:
3+
acmeCo/crashes:
4+
schema: crashes.schema.yaml
5+
key:
6+
- /date
7+
- /package_name
8+
- /row_number
9+
acmeCo/installs:
10+
schema: installs.schema.yaml
11+
key:
12+
- /date
13+
- /package_name
14+
- /row_number
15+
acmeCo/reviews:
16+
schema: reviews.schema.yaml
17+
key:
18+
- /package_name
19+
- /row_number
20+
- /year_month
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
---
2+
$defs:
3+
Meta:
4+
properties:
5+
op:
6+
default: u
7+
description: "Operation type (c: Create, u: Update, d: Delete)"
8+
enum:
9+
- c
10+
- u
11+
- d
12+
title: Op
13+
type: string
14+
row_id:
15+
default: -1
16+
description: "Row ID of the Document, counting up from zero, or -1 if not known"
17+
title: Row Id
18+
type: integer
19+
title: Meta
20+
type: object
21+
additionalProperties: true
22+
properties:
23+
_meta:
24+
$ref: "#/$defs/Meta"
25+
default:
26+
op: u
27+
row_id: -1
28+
description: Document metadata
29+
package_name:
30+
title: Package Name
31+
type: string
32+
row_number:
33+
title: Row Number
34+
type: integer
35+
date:
36+
title: Date
37+
type: string
38+
required:
39+
- package_name
40+
- row_number
41+
- date
42+
title: Installs
43+
type: object
44+
x-infer-schema: true
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
---
2+
$defs:
3+
Meta:
4+
properties:
5+
op:
6+
default: u
7+
description: "Operation type (c: Create, u: Update, d: Delete)"
8+
enum:
9+
- c
10+
- u
11+
- d
12+
title: Op
13+
type: string
14+
row_id:
15+
default: -1
16+
description: "Row ID of the Document, counting up from zero, or -1 if not known"
17+
title: Row Id
18+
type: integer
19+
title: Meta
20+
type: object
21+
additionalProperties: true
22+
properties:
23+
_meta:
24+
$ref: "#/$defs/Meta"
25+
default:
26+
op: u
27+
row_id: -1
28+
description: Document metadata
29+
package_name:
30+
title: Package Name
31+
type: string
32+
row_number:
33+
title: Row Number
34+
type: integer
35+
year_month:
36+
title: Year Month
37+
type: string
38+
required:
39+
- package_name
40+
- row_number
41+
- year_month
42+
title: Reviews
43+
type: object
44+
x-infer-schema: true

source-google-play/poetry.lock

Lines changed: 4253 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

source-google-play/pyproject.toml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
[tool.poetry]
2+
version = "0.1.0"
3+
name = "source_google_play"
4+
description = ""
5+
authors = ["Alex Bair <alexb@estuary.dev>"]
6+
7+
[tool.poetry.dependencies]
8+
estuary-cdk = {path="../estuary-cdk", develop = true}
9+
python = "^3.12"
10+
pydantic = "^2"
11+
12+
[tool.poetry.group.dev.dependencies]
13+
debugpy = "^1.8.0"
14+
mypy = "^1.8.0"
15+
pytest = "^7.4.3"
16+
pytest-insta = "^0.3.0"
17+
18+
[build-system]
19+
requires = ["poetry-core"]
20+
build-backend = "poetry.core.masonry.api"
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
---
2+
bucket: pubsite_prod_rev_01234567890987654321_my_bucket
3+
start_date: "2025-07-24T00:00:00Z"
4+
credentials:
5+
credentials_title: Google Service Account
6+
service_account: "{\n \"type\": \"service_account\",\n \"auth_uri\": \"https://accounts.google.com/o/oauth2/auth\",\n \"token_uri\": \"https://oauth2.googleapis.com/token\",\n \"auth_provider_x509_cert_url\": \"https://www.googleapis.com/oauth2/v1/certs\",\n \"client_x509_cert_url\": \"some_cert_url\",\n \"universe_domain\": \"googleapis.com\"\n}\n"
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Status
2+
3+
This connector is still in development since we haven't been able to see what the Google Play data looks like without valid credentials.
4+
5+
# What's been tested
6+
- `GCSClient.list_all_files` does list all files in a GCS bucket.
7+
- The `prefix` and `globPattern` inputs work to filter what files are returned.
8+
- The `GoogleServiceAccount` credentials work with a valid service account JSON to make successful requests to GCS.
9+
- A capture can be created. It won't yield any documents, but we'll be able to finish development once we have an active capture created.
10+
11+
12+
# What hasn't been tested / outstanding questions
13+
- `GCSClient.stream_csv`
14+
- What's the dialect for the CSVs that are in the GCS bucket? I suspect the CSVs use UTF-16 encoding, so we will need to pass a custom `CSVConfig` into the `IncrementalCSVProcessor`.
15+
- The `_add_row_number` and `_extract_year_month` before model validators.
16+
- Do these work in general?
17+
- Are they inserting the correct values into each record?
18+
- Are there other, undocumented fields already in the records that contain the same data these model validators add?
19+
- Generally, what do the CSVs look like and what fields are always present in each row?
20+
- Are the CSVs named like the Google Play docs say they are?
21+
- Is there some nice `id` type field we can use as a unique identifier for a row across all CSVs?
22+
- For `Statistics`, in the CSV for the current month, are rows for previous days no longer updated? Are we able improve the incremental strategy for these streams by only yielding rows for the same date as the current log cursor?
23+
- For `Reviews`, in the CSV for the current month, is there an always populated field like `review_last_update_date_and_time` that has fine enough grain that we could use it as a cursor field?
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
from logging import Logger
2+
from typing import Awaitable, Callable
3+
4+
from estuary_cdk.flow import (
5+
ConnectorSpec,
6+
)
7+
from estuary_cdk.capture import (
8+
BaseCaptureConnector,
9+
Request,
10+
Task,
11+
common,
12+
request,
13+
response,
14+
)
15+
16+
from .resources import all_resources, validate_credentials
17+
from .models import (
18+
ConnectorState,
19+
EndpointConfig,
20+
ResourceConfig,
21+
)
22+
23+
24+
class Connector(
25+
BaseCaptureConnector[EndpointConfig, ResourceConfig, ConnectorState],
26+
):
27+
def request_class(self):
28+
return Request[EndpointConfig, ResourceConfig, ConnectorState]
29+
30+
async def spec(self, log: Logger, _: request.Spec) -> ConnectorSpec:
31+
return ConnectorSpec(
32+
configSchema=EndpointConfig.model_json_schema(),
33+
oauth2=None,
34+
documentationUrl="https://go.estuary.dev/source-google-play",
35+
resourceConfigSchema=ResourceConfig.model_json_schema(),
36+
resourcePathPointers=ResourceConfig.PATH_POINTERS,
37+
)
38+
39+
async def discover(
40+
self, log: Logger, discover: request.Discover[EndpointConfig]
41+
) -> response.Discovered[ResourceConfig]:
42+
resources = await all_resources(log, self, discover.config)
43+
return common.discovered(resources)
44+
45+
async def validate(
46+
self,
47+
log: Logger,
48+
validate: request.Validate[EndpointConfig, ResourceConfig],
49+
) -> response.Validated:
50+
await validate_credentials(log, self, validate.config)
51+
resources = await all_resources(log, self, validate.config)
52+
resolved = common.resolve_bindings(validate.bindings, resources)
53+
return common.validated(resolved)
54+
55+
async def open(
56+
self,
57+
log: Logger,
58+
open: request.Open[EndpointConfig, ResourceConfig, ConnectorState],
59+
) -> tuple[response.Opened, Callable[[Task], Awaitable[None]]]:
60+
resources = await all_resources(log, self, open.capture.config)
61+
resolved = common.resolve_bindings(open.capture.bindings, resources)
62+
return common.open(open, resolved)

0 commit comments

Comments
 (0)