Skip to content

Commit a71ef52

Browse files
committed
Add bulk export to bigquery setup guide
1 parent e53040f commit a71ef52

File tree

2 files changed

+272
-1
lines changed

2 files changed

+272
-1
lines changed

src/docs.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -395,7 +395,8 @@
395395
"langsmith/data-export",
396396
"langsmith/data-export-destinations",
397397
"langsmith/data-export-monitor",
398-
"langsmith/data-export-downstream"
398+
"langsmith/data-export-downstream",
399+
"langsmith/big-query-bulk-export"
399400
]
400401
}
401402
]
Lines changed: 270 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,270 @@
1+
---
2+
title: Export trace data to BigQuery
3+
sidebarTitle: BigQuery integration
4+
description: Load LangSmith trace data into BigQuery using bulk export to GCS.
5+
---
6+
7+
<Info>
8+
**Plan restrictions apply**
9+
10+
Bulk export is only available on [LangSmith Plus or Enterprise tiers](https://www.langchain.com/pricing-langsmith).
11+
</Info>
12+
13+
LangSmith can export trace data to a Google Cloud Storage (GCS) bucket in Parquet format. From there, you can load it into BigQuery as an external table (queried in place from GCS) or as a native table (copied into BigQuery storage).
14+
15+
This guide covers:
16+
17+
- Setting up a GCS bucket and HMAC credentials for LangSmith.
18+
- Creating a bulk export destination and export job.
19+
- Loading the exported data into BigQuery.
20+
21+
For full details on bulk export configuration options, refer to [Bulk export trace data](/langsmith/data-export) and [Manage bulk export destinations](/langsmith/data-export-destinations).
22+
23+
## Prerequisites
24+
25+
- Data in your LangSmith [Tracing project](https://smith.langchain.com/projects).
26+
- [`gcloud` CLI installed](https://docs.cloud.google.com/sdk/docs/install-sdk). (You can also use the Google Cloud console for setup.)
27+
28+
## 1. Create a GCS bucket
29+
30+
Create a dedicated GCS bucket for LangSmith exports. Using a dedicated bucket makes it easier to grant scoped permissions without affecting other data:
31+
32+
```bash
33+
gcloud storage buckets create gs://YOUR_BUCKET_NAME \
34+
--location=US \
35+
--uniform-bucket-level-access
36+
```
37+
38+
Choose a region close to your BigQuery dataset to minimize latency and avoid cross-region egress charges.
39+
40+
## 2. Create a service account and grant access
41+
42+
Create a GCP service account that LangSmith will use to write data to GCS:
43+
44+
```bash
45+
gcloud iam service-accounts create langsmith-bulk-export \
46+
--display-name="LangSmith Bulk Export"
47+
```
48+
49+
Grant the service account write access to your bucket. The minimum required permission is `storage.objects.create`. Granting `storage.objects.delete` is optional, but recommended. LangSmith uses it to clean up a temporary test file created during destination validation. If this permission is absent, a `tmp/` folder may remain in your bucket.
50+
51+
The "Storage Object Admin" predefined role covers all required and recommended permissions:
52+
53+
```bash
54+
gcloud storage buckets add-iam-policy-binding gs://YOUR_BUCKET_NAME \
55+
--member="serviceAccount:langsmith-bulk-export@YOUR_PROJECT.iam.gserviceaccount.com" \
56+
--role="roles/storage.objectAdmin"
57+
```
58+
59+
To use a minimal custom role instead, grant only:
60+
61+
- `storage.objects.create` (required)
62+
- `storage.objects.delete` (optional, for test file cleanup)
63+
- `storage.objects.get` (optional but recommended, for file size verification)
64+
- `storage.multipartUploads.create` (optional but recommended, for large file uploads)
65+
66+
## 3. Generate HMAC keys
67+
68+
LangSmith connects to GCS using the S3-compatible XML API, which requires HMAC keys rather than a service account JSON key.
69+
70+
Generate HMAC keys for your service account:
71+
72+
```bash
73+
gcloud storage hmac create \
74+
langsmith-bulk-export@YOUR_PROJECT.iam.gserviceaccount.com
75+
```
76+
77+
Save the `accessId` and `secret` from the output. You can also generate HMAC keys in the GCP Console under **Cloud Storage → Settings → Interoperability → Create a key for a service account**.
78+
79+
## 4. Create a bulk export destination
80+
81+
Create a destination in LangSmith pointing to your GCS bucket. Set `endpoint_url` to `https://storage.googleapis.com` to use the GCS S3-compatible API.
82+
83+
You will need your [LangSmith API key](/langsmith/create-account-api-key) and [workspace ID](/langsmith/set-up-hierarchy#set-up-a-workspace).
84+
85+
```bash
86+
curl --request POST \
87+
--url 'https://api.smith.langchain.com/api/v1/bulk-exports/destinations' \
88+
--header 'Content-Type: application/json' \
89+
--header 'X-API-Key: YOUR_API_KEY' \
90+
--header 'X-Tenant-Id: YOUR_WORKSPACE_ID' \
91+
--data '{
92+
"destination_type": "s3",
93+
"display_name": "GCS for BigQuery",
94+
"config": {
95+
"bucket_name": "YOUR_BUCKET_NAME",
96+
"prefix": "YOUR_PREFIX",
97+
"endpoint_url": "https://storage.googleapis.com"
98+
},
99+
"credentials": {
100+
"access_key_id": "YOUR_HMAC_ACCESS_ID",
101+
"secret_access_key": "YOUR_HMAC_SECRET"
102+
}
103+
}'
104+
```
105+
106+
`prefix` is a path within the bucket where LangSmith will write exported files. For example, `langsmith-exports` or `data/traces`. Choose any value that works for your bucket layout.
107+
108+
LangSmith validates the credentials by performing a test write before saving the destination. If the request returns a `400` error, refer to [Debug destination errors](/langsmith/data-export-destinations#debug-destination-errors).
109+
110+
Save the `id` from the response; you will need it in the next step.
111+
112+
### Temporary validation file
113+
114+
During destination creation (and [credential rotation](#credential-rotation)), LangSmith writes a temporary `.txt` file to `YOUR_PREFIX/tmp/` to verify write access, then attempts to delete it. The deletion is best-effort: if the service account lacks `storage.objects.delete`, the file is not deleted and the `tmp/` folder remains in your bucket.
115+
116+
The `tmp/` folder does not affect exports, but it will be included in broad GCS URI globs (e.g., `gs://YOUR_BUCKET_NAME/YOUR_PREFIX/*`).
117+
118+
## 5. Create a bulk export job
119+
120+
Create an export targeting a specific project. Use `format_version: v2_beta` for BigQuery compatibility—it produces UTC timezone-aware timestamps that BigQuery handles correctly.
121+
122+
You will need the project ID (`session_id`), which you can copy from the project view in the [**Tracing Projects** list](https://smith.langchain.com).
123+
124+
**One-time export:**
125+
126+
```bash
127+
curl --request POST \
128+
--url 'https://api.smith.langchain.com/api/v1/bulk-exports' \
129+
--header 'Content-Type: application/json' \
130+
--header 'X-API-Key: YOUR_API_KEY' \
131+
--header 'X-Tenant-Id: YOUR_WORKSPACE_ID' \
132+
--data '{
133+
"bulk_export_destination_id": "YOUR_DESTINATION_ID",
134+
"session_id": "YOUR_PROJECT_ID",
135+
"start_time": "2024-01-01T00:00:00Z",
136+
"end_time": "2024-02-01T00:00:00Z",
137+
"format_version": "v2_beta",
138+
"compression": "snappy"
139+
}'
140+
```
141+
142+
**Scheduled (recurring) export:**
143+
144+
```bash
145+
curl --request POST \
146+
--url 'https://api.smith.langchain.com/api/v1/bulk-exports' \
147+
--header 'Content-Type: application/json' \
148+
--header 'X-API-Key: YOUR_API_KEY' \
149+
--header 'X-Tenant-Id: YOUR_WORKSPACE_ID' \
150+
--data '{
151+
"bulk_export_destination_id": "YOUR_DESTINATION_ID",
152+
"session_id": "YOUR_PROJECT_ID",
153+
"start_time": "2024-01-01T00:00:00Z",
154+
"interval_hours": 24,
155+
"format_version": "v2_beta",
156+
"compression": "snappy"
157+
}'
158+
```
159+
160+
Snappy compression is fast and widely supported by BigQuery. For all available options, refer to [Bulk export trace data](/langsmith/data-export#2-create-an-export-job), including field filtering and filter expressions.
161+
162+
### Output file structure
163+
164+
Exported files land in GCS using a Hive-partitioned path structure:
165+
166+
```
167+
gs://YOUR_BUCKET_NAME/YOUR_PREFIX/export_id=<uuid>/tenant_id=<uuid>/session_id=<uuid>/resource=runs/year=<year>/month=<month>/day=<day>/<filename>.parquet
168+
```
169+
170+
The partition columns in the path (`export_id`, `tenant_id`, `session_id`, `resource`, `year`, `month`, `day`) are available as queryable columns in BigQuery when Hive partition detection is enabled.
171+
172+
## 6. Load data into BigQuery
173+
174+
BigQuery offers two ways to access your exported data. Both require granting the BigQuery service account read access to your GCS bucket first. Choose based on your needs:
175+
176+
- **External table:** data stays in GCS and BigQuery queries it in place. No storage costs in BigQuery, but query performance is slower than native storage. Refer to [Required roles](https://docs.cloud.google.com/bigquery/docs/query-cloud-storage-data#required-roles).
177+
- **Native table:** data is copied into BigQuery storage. Faster queries and full support for BigQuery features, but incurs BigQuery storage costs. Refer to [Required permissions](https://docs.cloud.google.com/bigquery/docs/cloud-storage-transfer#required_permissions).
178+
179+
### Create the table
180+
181+
<Tabs>
182+
<Tab title="External table">
183+
An external table queries data directly from GCS without copying it into BigQuery.
184+
185+
1. In the BigQuery console, expand your project and dataset in the **Explorer** pane.
186+
1. Click the dataset's **Actions** menu (three dots) and select **Create table**.
187+
1. Under **Source**:
188+
- Set **Create table from** to **Google Cloud Storage**.
189+
- Set the file path to `gs://YOUR_BUCKET_NAME/YOUR_PREFIX/export_id=*`. Using `export_id=*` scopes BigQuery to Hive-partitioned export directories and excludes the `tmp/` folder that LangSmith writes during destination validation (see [Temporary validation file](#temporary-validation-file)).
190+
- Set **File format** to **Parquet**.
191+
1. Check **Source data partitioning**, then:
192+
- Set **Source URI prefix** to `gs://YOUR_BUCKET_NAME/YOUR_PREFIX`.
193+
- Set **Partition inference mode** to **Automatically infer types**.
194+
1. Under **Destination**:
195+
- Select your project and dataset.
196+
- Enter a table name, for example `langsmith_runs`.
197+
- Set **Table type** to **External table**.
198+
1. Under **Schema**, enable **Auto-detect**.
199+
1. Click **Create table**.
200+
201+
The partition path columns (`export_id`, `tenant_id`, `session_id`, `resource`, `year`, `month`, `day`) are available as queryable columns. Filter on `year`, `month`, or `day` in your queries to enable partition pruning.
202+
</Tab>
203+
<Tab title="Native table">
204+
A native table transfers the Parquet data into BigQuery storage for full query performance.
205+
206+
1. Go to the [Data Transfer page](https://console.cloud.google.com/bigquery/transfers) in the Google Cloud console and select **+ Create transfer**.
207+
1. For **Source type**, select **Google Cloud Storage**.
208+
1. Enter a **Transfer name**. You'll have access to edit the transfer at a point if necessary.
209+
1. Select a **Schedule option**. If you do not want to repeat the export, you can select **On demand** and trigger the export manually.
210+
211+
1. In the BigQuery console, expand your project and dataset in the **Explorer** pane.
212+
1. Click the dataset's **Actions** menu (three dots) and select **Create table**.
213+
1. Under **Source**:
214+
- Set **Create table from** to **Google Cloud Storage**.
215+
- Set the file path to `gs://YOUR_BUCKET_NAME/YOUR_PREFIX/export_id=*`. Using `export_id=*` excludes the `tmp/` folder that LangSmith writes during destination validation (see [Temporary validation file](#temporary-validation-file)).
216+
- Set **File format** to **Parquet**.
217+
1. Check **Source data partitioning**, then:
218+
- Set **Source URI prefix** to `gs://YOUR_BUCKET_NAME/YOUR_PREFIX`.
219+
- Set **Partition inference mode** to **Automatically infer types**.
220+
1. Under **Destination**:
221+
- Select your project and dataset.
222+
- Enter a table name, for example `langsmith_runs`.
223+
- Set **Table type** to **Native table**.
224+
1. Under **Advanced options**, set **Write preference** to **Write if empty** for a new table.
225+
1. Click **Create table**.
226+
227+
BigQuery runs a load job to copy the data. The Hive partition columns appear as regular columns in the table. For the full list of available data columns, see [Exportable fields](/langsmith/data-export#exportable-fields).
228+
</Tab>
229+
</Tabs>
230+
231+
## Credential rotation
232+
233+
To rotate your HMAC keys without interrupting active exports:
234+
235+
1. **Generate new HMAC keys** in GCP for the same service account.
236+
2. **Call the PATCH endpoint** with the new credentials:
237+
238+
```bash
239+
curl --request PATCH \
240+
--url 'https://api.smith.langchain.com/api/v1/bulk-exports/destinations/YOUR_DESTINATION_ID' \
241+
--header 'Content-Type: application/json' \
242+
--header 'X-API-Key: YOUR_API_KEY' \
243+
--header 'X-Tenant-Id: YOUR_WORKSPACE_ID' \
244+
--data '{
245+
"credentials": {
246+
"access_key_id": "NEW_HMAC_ACCESS_ID",
247+
"secret_access_key": "NEW_HMAC_SECRET"
248+
}
249+
}'
250+
```
251+
252+
LangSmith validates the new credentials with a test write before saving. A new `tmp/` file may appear in your bucket during this validation (see [Temporary validation file](#temporary-validation-file)).
253+
254+
3. **Keep old HMAC keys active** until all in-flight export runs complete. Both credential sets are valid simultaneously during the transition window.
255+
4. **Delete the old HMAC keys** in GCP once you have confirmed no in-flight runs are using them.
256+
257+
For full details, see [Rotate destination credentials](/langsmith/data-export-destinations#rotate-destination-credentials).
258+
259+
## Troubleshooting
260+
261+
| Symptom | Likely cause | Fix |
262+
|---------|--------------|-----|
263+
| `400 Access denied` on destination creation | HMAC credentials lack write permission | Verify the service account has `storage.objects.create` on the bucket |
264+
| `400 Key ID you provided does not exist` | HMAC access ID is invalid | Regenerate HMAC keys in GCP |
265+
| `400 Invalid endpoint` | Endpoint URL is malformed | Use exactly `https://storage.googleapis.com` |
266+
| BigQuery table shows no rows | Export not yet complete | Check export status with `GET /api/v1/bulk-exports/{export_id}` |
267+
| BigQuery partition pruning not working | Incorrect source URI prefix | Ensure the source URI prefix ends before the first partition key, e.g. `gs://BUCKET/PREFIX` |
268+
| BigQuery picks up `tmp/` files | Broad file path glob | Use `export_id=*` in your file path instead of `*` |
269+
270+
For additional error codes and export status details, see [Monitor and troubleshoot bulk exports](/langsmith/data-export-monitor).

0 commit comments

Comments
 (0)