sidequery
diff --git a/‎README.md‎
Lines changed: 223 additions & 47 deletions b/‎README.md‎
Lines changed: 223 additions & 47 deletions
diff --git a/‎scripts/events-pipeline-schema.json‎
Lines changed: 24 additions & 0 deletions b/‎scripts/events-pipeline-schema.json‎
Lines changed: 24 additions & 0 deletions
@@ -2,14 +2,15 @@
 
 <img src="hog.png" alt="Hogflare" width="300">
 
-Hogflare is a Cloudflare Workers ingestion layer for PostHog SDKs. It supports PostHog-style ingestion, stateful persons/groups, and SDK feature flags, then streams events into Cloudflare Pipelines so data lands in R2 as Iceberg/Parquet.
+Hogflare is a Cloudflare Workers ingestion layer for PostHog SDKs. It supports PostHog-style ingestion, stateful persons/groups, and SDK feature flags, then streams events and person snapshots into Cloudflare Pipelines so data lands in R2 as Iceberg/Parquet.
 
 #### What works today
 
 - Ingestion endpoints: `/capture`, `/identify`, `/alias`, `/batch`, `/e`, `/engage`, `/groups`
 - Persons and groups: `$set`, `$set_once`, `$unset`, aliasing, and group properties
 - Feature flags: `/flags` and `/decide` are evaluated in the Worker (used by PostHog SDKs)
 - Request enrichment: Cloudflare IP/geo fields added when missing
+- Queryable people: append-only person snapshots can be written to a separate Iceberg table
 
 ## Architecture
 
@@ -34,8 +35,10 @@ flowchart TB
         PersonsDO -.-> PersonIdDO
     end
 
-    Worker -->|"events"| Pipeline["Cloudflare Pipelines"]
-    Pipeline --> R2["R2 Data Catalog<br/>(Iceberg/Parquet)"]
+    Worker -->|"events"| EventsPipeline["Events Pipeline"]
+    Worker -->|"person snapshots"| PersonsPipeline["Persons Pipeline"]
+    EventsPipeline --> EventsR2["R2 Data Catalog<br/>events table"]
+    PersonsPipeline --> PersonsR2["R2 Data Catalog<br/>persons table"]
 ```
 
 ## Why?
@@ -48,44 +51,70 @@ Admittedly, PostHog does a *lot* more than this package, but some folks really j
 
 ## Quick start (Cloudflare)
 
-1) Create a Pipeline stream and sink in the Cloudflare dashboard or via `wrangler pipelines setup`.
-2) Use the schema below for the stream.
-3) Copy `wrangler.toml.example` to `wrangler.toml` and set variables.
-4) Set Wrangler secrets.
-5) Deploy the Worker.
+1. Create R2 Data Catalog-backed Pipelines resources.
+2. Copy `wrangler.toml.example` to `wrangler.toml` and set the stream endpoints.
+3. Set Wrangler secrets.
+4. Build and deploy the Worker.
+5. Send a capture/identify verification flow and query the Iceberg tables.
 
-### Pipeline schema (JSON)
+The examples below use stable table names for a fresh deployment: `default.hogflare_events` and `default.hogflare_persons`. If you use versioned names during migration, substitute those names consistently in the sink commands and queries.
 
-```json
-{
-  "fields": [
-    { "name": "uuid", "type": "string", "required": true },
-    { "name": "team_id", "type": "int64", "required": false },
-    { "name": "source", "type": "string", "required": true },
-    { "name": "event", "type": "string", "required": true },
-    { "name": "distinct_id", "type": "string", "required": true },
-    { "name": "timestamp", "type": "timestamp", "required": false },
-    { "name": "created_at", "type": "timestamp", "required": true },
-    { "name": "properties", "type": "json", "required": false },
-    { "name": "context", "type": "json", "required": false },
-    { "name": "person_id", "type": "string", "required": false },
-    { "name": "person_created_at", "type": "timestamp", "required": false },
-    { "name": "person_properties", "type": "json", "required": false },
-    { "name": "group0", "type": "string", "required": false },
-    { "name": "group1", "type": "string", "required": false },
-    { "name": "group2", "type": "string", "required": false },
-    { "name": "group3", "type": "string", "required": false },
-    { "name": "group4", "type": "string", "required": false },
-    { "name": "group_properties", "type": "json", "required": false },
-    { "name": "api_key", "type": "string", "required": false },
-    { "name": "extra", "type": "json", "required": false }
-  ]
-}
+### Create Pipelines Resources
+
+Set these values before creating sinks:
+
+```bash
+export R2_BUCKET="<bucket-name>"
+export R2_CATALOG_TOKEN="<r2-data-catalog-token>"
 ```
 
+`R2_CATALOG_TOKEN` is the token used by R2 Data Catalog/R2 SQL clients such as DuckDB or PyIceberg. The bucket must have R2 Data Catalog enabled before creating `r2-data-catalog` sinks.
+
+Create the events stream, sink, and pipeline:
+
+```bash
+bunx wrangler pipelines streams create hogflare_events_stream \
+  --schema-file scripts/events-pipeline-schema.json \
+  --http-enabled true \
+  --http-auth true
+
+bunx wrangler pipelines sinks create hogflare_events_sink \
+  --type r2-data-catalog \
+  --bucket "$R2_BUCKET" \
+  --namespace default \
+  --table hogflare_events \
+  --catalog-token "$R2_CATALOG_TOKEN" \
+  --roll-interval 60
+
+bunx wrangler pipelines create hogflare_events_pipeline \
+  --sql "INSERT INTO hogflare_events_sink SELECT * FROM hogflare_events_stream;"
+```
+
+Create the persons stream, sink, and pipeline if you want queryable people in Iceberg:
+
+```bash
+bunx wrangler pipelines streams create hogflare_persons_stream \
+  --schema-file scripts/persons-pipeline-schema.json \
+  --http-enabled true \
+  --http-auth true
+
+bunx wrangler pipelines sinks create hogflare_persons_sink \
+  --type r2-data-catalog \
+  --bucket "$R2_BUCKET" \
+  --namespace default \
+  --table hogflare_persons \
+  --catalog-token "$R2_CATALOG_TOKEN" \
+  --roll-interval 60
+
+bunx wrangler pipelines create hogflare_persons_pipeline \
+  --sql "INSERT INTO hogflare_persons_sink SELECT * FROM hogflare_persons_stream;"
+```
+
+Each stream creation command prints an HTTP endpoint like `https://<stream-id>.ingest.cloudflare.com`. Use those endpoints in `wrangler.toml`.
+
 ### Wrangler config
 
-Copy the example and fill in your stream endpoint:
+Copy the example and fill in the stream endpoints:
 
 ```bash
 cp wrangler.toml.example wrangler.toml
@@ -98,6 +127,7 @@ compatibility_date = "2025-01-09"
 
 [vars]
 CLOUDFLARE_PIPELINE_ENDPOINT = "https://<stream-id>.ingest.cloudflare.com"
+CLOUDFLARE_PERSONS_PIPELINE_ENDPOINT = "https://<persons-stream-id>.ingest.cloudflare.com"
 CLOUDFLARE_PIPELINE_TIMEOUT_SECS = "10"
 
 # Optional
@@ -129,34 +159,120 @@ tag = "v2"
 new_sqlite_classes = ["PersonIdCounterDurableObject", "GroupDurableObject"]
 ```
 
+### Configuration Reference
+
+| Setting | Required | Notes |
+| --- | --- | --- |
+| `CLOUDFLARE_PIPELINE_ENDPOINT` | Yes | Events stream HTTP endpoint from `wrangler pipelines streams create`. |
+| `CLOUDFLARE_PIPELINE_AUTH_TOKEN` | Yes, for authenticated streams | Bearer token used for events stream HTTP ingest. |
+| `CLOUDFLARE_PERSONS_PIPELINE_ENDPOINT` | No | Persons stream endpoint. Set this to write person snapshots to Iceberg. |
+| `CLOUDFLARE_PERSONS_PIPELINE_AUTH_TOKEN` | No | Falls back to `CLOUDFLARE_PIPELINE_AUTH_TOKEN` when omitted. |
+| `CLOUDFLARE_PIPELINE_TIMEOUT_SECS` | No | Defaults to 10 seconds. |
+| `POSTHOG_API_KEY` | No | Default project token returned by `/decide` when request/header token is absent. |
+| `POSTHOG_TEAM_ID` | No | Optional team id attached to event and person rows. |
+| `POSTHOG_GROUP_TYPE_0..4` | No | Maps PostHog group types to `group0..group4`; set `POSTHOG_GROUP_TYPE_0=company` to populate `group0` for company groups. |
+| `POSTHOG_SESSION_RECORDING_ENDPOINT` | No | Returned in `/decide` session recording config. |
+| `POSTHOG_SIGNING_SECRET` | No | Enables HMAC request signature checks. |
+| `PERSON_DEBUG_TOKEN` | No | Enables `/__debug/person/:id` for deployment verification. |
+| `HOGFLARE_FEATURE_FLAGS` | No | JSON flag config used by `/decide` and `/flags`. |
+
 ### Secrets
 
+Use a Cloudflare API token that can write to Pipelines for `CLOUDFLARE_PIPELINE_AUTH_TOKEN`. The same token can usually be reused for the persons stream.
+
 ```bash
 bunx wrangler secret put CLOUDFLARE_PIPELINE_AUTH_TOKEN
+# Optional. If omitted, the persons pipeline uses CLOUDFLARE_PIPELINE_AUTH_TOKEN.
+bunx wrangler secret put CLOUDFLARE_PERSONS_PIPELINE_AUTH_TOKEN
+
+# Optional.
 bunx wrangler secret put POSTHOG_SIGNING_SECRET
+bunx wrangler secret put PERSON_DEBUG_TOKEN
+bunx wrangler secret put HOGFLARE_FEATURE_FLAGS
 ```
 
 ### Deploy
 
 ```bash
+worker-build --release
 bunx wrangler deploy
 ```
 
-## Send a test event
+## Verify Deployment
 
 ```bash
-curl -X POST https://<your-worker>.workers.dev/capture \
+export HOGFLARE_URL="https://<your-worker>.workers.dev"
+export HOGFLARE_API_KEY="phc_verify_$(date -u +%Y%m%d%H%M%S)"
+export HOGFLARE_ANON_ID="${HOGFLARE_API_KEY}_anon"
+export HOGFLARE_USER_ID="${HOGFLARE_API_KEY}_user"
+```
+
+Send an anonymous capture:
+
+```bash
+curl -X POST "$HOGFLARE_URL/capture" \
   -H "Content-Type: application/json" \
-  -d '[
-    {
-      "api_key": "phc_example",
-      "event": "purchase",
-      "distinct_id": "user_12345",
-      "properties": { "amount": 29.99, "product_id": "widget-001" }
+  -d "{
+    \"api_key\": \"$HOGFLARE_API_KEY\",
+    \"event\": \"verify-anon-capture\",
+    \"distinct_id\": \"$HOGFLARE_ANON_ID\",
+    \"properties\": {
+      \"\$set\": { \"initial_referrer\": \"docs\" },
+      \"\$set_once\": { \"first_seen_source\": \"readme\" }
     }
-  ]'
+  }"
 ```
 
+Identify the user and link the anonymous ID:
+
+```bash
+curl -X POST "$HOGFLARE_URL/identify" \
+  -H "Content-Type: application/json" \
+  -d "{
+    \"api_key\": \"$HOGFLARE_API_KEY\",
+    \"distinct_id\": \"$HOGFLARE_USER_ID\",
+    \"properties\": {
+      \"\$anon_distinct_id\": \"$HOGFLARE_ANON_ID\",
+      \"\$set\": { \"email\": \"verify@example.com\", \"plan\": \"pro\" },
+      \"\$set_once\": { \"signup_source\": \"readme\" }
+    }
+  }"
+```
+
+Send a post-identify capture:
+
+```bash
+curl -X POST "$HOGFLARE_URL/capture" \
+  -H "Content-Type: application/json" \
+  -d "{
+    \"api_key\": \"$HOGFLARE_API_KEY\",
+    \"event\": \"verify-identified-capture\",
+    \"distinct_id\": \"$HOGFLARE_USER_ID\",
+    \"properties\": { \"button\": \"verify\" }
+  }"
+```
+
+Wait for the sink roll interval, then query R2 SQL:
+
+```bash
+export R2_WAREHOUSE="<account-id>_<bucket-name>"
+export WRANGLER_R2_SQL_AUTH_TOKEN="$R2_CATALOG_TOKEN"
+
+bunx wrangler r2 sql query "$R2_WAREHOUSE" \
+  "select event, distinct_id, person_id, person_properties
+   from default.hogflare_events
+   where api_key = '$HOGFLARE_API_KEY'
+   order by created_at asc"
+
+bunx wrangler r2 sql query "$R2_WAREHOUSE" \
+  "select operation, canonical_distinct_id, person_id, distinct_ids, merged_properties
+   from default.hogflare_persons
+   where api_key = '$HOGFLARE_API_KEY'
+   order by updated_at asc"
+```
+
+Expected result: the three event rows share one `person_id`, and the persons table has `capture`, `identify`, `capture` snapshots. After identify, `distinct_ids` should include both the anonymous and identified IDs.
+
 ## HMAC signing (optional)
 
 If `POSTHOG_SIGNING_SECRET` is set, requests must include a valid signature.
@@ -227,6 +343,7 @@ docker compose up --build -d fake-pipeline
 ```bash
 # .env.local (not committed)
 CLOUDFLARE_PIPELINE_ENDPOINT=http://127.0.0.1:8088/
+CLOUDFLARE_PERSONS_PIPELINE_ENDPOINT=http://127.0.0.1:8088/
 CLOUDFLARE_PIPELINE_TIMEOUT_SECS=5
 ```
 
@@ -252,8 +369,44 @@ ATTACH '<ACCOUNT_ID>_<BUCKET>' AS iceberg_catalog (
   ENDPOINT 'https://catalog.cloudflarestorage.com/<ACCOUNT_ID>/<BUCKET>'
 );
 
-SELECT count(*) FROM iceberg_catalog.default.hogflare;
-SELECT * FROM iceberg_catalog.default.hogflare LIMIT 5;
+SELECT count(*) FROM iceberg_catalog.default.hogflare_events;
+SELECT count(*) FROM iceberg_catalog.default.hogflare_persons;
+SELECT * FROM iceberg_catalog.default.hogflare_persons LIMIT 5;
+```
+
+If you used versioned table names during a migration, substitute those names here.
+
+## Cleanup
+
+Delete Pipelines resources in dependency order: pipelines first, then streams and sinks.
+
+```bash
+bunx wrangler pipelines list
+bunx wrangler pipelines delete <pipeline-id> --force
+
+bunx wrangler pipelines streams list
+bunx wrangler pipelines streams delete <stream-id> --force
+
+bunx wrangler pipelines sinks list
+bunx wrangler pipelines sinks delete <sink-id> --force
+```
+
+`wrangler r2 sql query` is read-only. To drop an Iceberg table from R2 Data Catalog, use the Iceberg catalog API. One local option is PyIceberg:
+
+```bash
+R2_CATALOG_TOKEN="<r2-data-catalog-token>" uv run --with pyiceberg python - <<'PY'
+import os
+from pyiceberg.catalog.rest import RestCatalog
+
+catalog = RestCatalog(
+    name="hogflare",
+    warehouse="<account-id>_<bucket-name>",
+    uri="https://catalog.cloudflarestorage.com/<account-id>/<bucket-name>",
+    token=os.environ["R2_CATALOG_TOKEN"],
+)
+
+catalog.drop_table(("default", "<table-name>"), purge_requested=True)
+PY
 ```
 
 ## PostHog compatibility
@@ -276,7 +429,7 @@ Identify, capture `$set` / `$set_once` / `$unset`, and alias events update a per
 - `person_created_at`
 - `person_properties`
 
-Person DO state is not written to R2. Only event-level snapshots are stored in the pipeline sink.
+The Durable Object is the source of truth for the current person record. When `CLOUDFLARE_PERSONS_PIPELINE_ENDPOINT` is configured, Hogflare also writes append-only person snapshots to the persons pipeline so the state is queryable in Iceberg.
 
 ### Groups
 
@@ -403,3 +556,26 @@ Each row is a `PipelineEvent` with these columns:
 | `group_properties` | JSON (by group type) |
 | `api_key` | string |
 | `extra` | JSON |
+
+## Person shape in R2
+
+Each row is a `PersonPipelineRecord` snapshot with these columns:
+
+| Field | Type / Notes |
+| --- | --- |
+| `uuid` | string (snapshot UUID v4) |
+| `team_id` | int64 (optional) |
+| `source` | string |
+| `operation` | capture, identify, alias, engage, session_recording |
+| `person_id` | string (person UUID) |
+| `person_int_id` | int64 |
+| `canonical_distinct_id` | string |
+| `distinct_ids` | string list / array |
+| `created_at` | person creation timestamp |
+| `updated_at` | snapshot timestamp |
+| `version` | person version |
+| `properties` | JSON `$set` properties |
+| `properties_set_once` | JSON `$set_once` properties |
+| `merged_properties` | JSON merged person properties |
+| `api_key` | string |
+| `source_event_uuid` | event row UUID that produced the snapshot |
@@ -0,0 +1,24 @@
+{
+  "fields": [
+    { "name": "uuid", "type": "string", "required": true },
+    { "name": "team_id", "type": "int64", "required": false },
+    { "name": "source", "type": "string", "required": true },
+    { "name": "event", "type": "string", "required": true },
+    { "name": "distinct_id", "type": "string", "required": true },
+    { "name": "timestamp", "type": "timestamp", "required": false },
+    { "name": "created_at", "type": "timestamp", "required": true },
+    { "name": "properties", "type": "json", "required": false },
+    { "name": "context", "type": "json", "required": false },
+    { "name": "person_id", "type": "string", "required": false },
+    { "name": "person_created_at", "type": "timestamp", "required": false },
+    { "name": "person_properties", "type": "json", "required": false },
+    { "name": "group0", "type": "string", "required": false },
+    { "name": "group1", "type": "string", "required": false },
+    { "name": "group2", "type": "string", "required": false },
+    { "name": "group3", "type": "string", "required": false },
+    { "name": "group4", "type": "string", "required": false },
+    { "name": "group_properties", "type": "json", "required": false },
+    { "name": "api_key", "type": "string", "required": false },
+    { "name": "extra", "type": "json", "required": false }
+  ]
+}