Skip to content

Conversation

@pgcamus
Copy link
Contributor

@pgcamus pgcamus commented Oct 23, 2025

- curl -s https://bigquery.googleapis.com/discovery/v1/apis/bigquery/v2/rest | \
	jq -S >server/resources/discovery.json
- Emptied some fields to minimize the diff: basePath, baseUrl,
  batchPath, mtlsRootUrl, and servicePath
@ohaibbq
Copy link
Contributor

ohaibbq commented Oct 23, 2025

This looks good to me! @goccy should we merge it?

@goccy and I have been discussing moving this repository to a joint organization so that it can become more active. Hopefully that'll happen soon! If not, you're welcome to open this against our fork which has many improvements made to it at Recidiviz/bigquery-emulator

@pgcamus
Copy link
Contributor Author

pgcamus commented Oct 23, 2025

Thanks, @ohaibbq! I've got a bunch of PRs that I'd like to get merged 😄 We're becoming fairly heavy users of this so it would be great to keep in sync with upstream.

@goccy
Copy link
Owner

goccy commented Oct 24, 2025

Thank you for your contribution ! LGTM 👍

@goccy goccy merged commit 6d8cfcf into goccy:main Oct 24, 2025
4 checks passed
@pgcamus
Copy link
Contributor Author

pgcamus commented Oct 24, 2025

@ohaibbq @goccy I'm very interested in doing what I can to support future emulator development, including participating in a separate organization. Please ping me pg [at] camus dot energy if there's anything I can do to help. In the meantime, I'll keep submitting PRs!

@ohaibbq
Copy link
Contributor

ohaibbq commented Oct 24, 2025

Thanks! We also rely on it very heavily, we have a suite nearly 1,500 tests for thousands of our tables/views. Here's the list of improvements in our fork, which will hopefully be consolidated once we get the organization situation worked out.

BigQuery Emulator and ZetaSQL Forks: Change Summary

This document summarizes all changes made in Recidiviz's forks of the BigQuery Emulator and go-zetasqlite libraries.

Recidiviz/bigquery-emulator

Performance Improvements

PR #32: Detect view schema from changed catalog (Sep 2025)

  • Small perf tweak that removes an extra query that was issued to detect view schema

PR #29: Build and push arm64 images (Aug 2025)

  • Builds and pushes both linux/amd64 and linux/arm64 images
  • Greatly increases performance on emulator tests run on arm64

PR #12: Refactor API data-access pattern (Apr 2024)

  • Greatly improves performance of emulator API endpoints
  • No longer loads entire BigQuery project (jobs, datasets, tables, etc) on each request
  • Utilizes unformatted SQLite queries instead of rewriting with functions like zetasqlite_equals
  • Many endpoints now take tens of microseconds instead of 100+ms
  • Table creation takes ~10ms instead of 300ms
  • Enables running deploy_empty_test_views (1,400 views) in ~35 seconds

PR #11: Return immediately from Job.wait() (Apr 2024)

  • Returns immediately if a Job has a response set
  • Reduces job insertion time from ~101ms to much faster

PR #10: Use in-memory storage (Apr 2024)

  • Supports :memory: specification for in-memory database

PR #9: Add --data-from-json option (Apr 2024)

  • JSON parser is exponentially faster than YAML parser for large files

Feature Additions

PR #28: Migrate to modernc.org/sqlite + Go 1.24 (May 2025)

  • Updates to latest Go version and SQLite library

PR #27: Support WRITE_TRUNCATE WriteDisposition (Mar 2025)

PR #22: Update go-zetasqlite for row access policy (Jan 2025)

  • Updates go-zetasqlite to no-op row access policy
  • Cherry-picks discovery v2 document fix for bq CLI

PR #21: Allow DROP VIEW and DROP MATERIALIZED VIEW (Nov 2024)

  • The DeleteTables function was only dropping tables, not views or materialized views
  • Moves table types to internal/types and checks table type when dropping

PR #20: Properly unmarshal JSON values (Jul 2024)

PR #8: Serve HTTPS connections to enable JDBC (Mar 2024)

  • Enables connection to BigQuery Emulator via JDBC for data exploration tools like JetBrains DataGrip or DBeaver
  • Includes SSL certificate setup and OAuth flow

Bug Fixes

PR #26: Fix view query end cutset (Feb 2025)

PR #25: Hydrate Table.Schema in tablesInsertRequest.Handle (Feb 2025)

  • Closes Recidiviz/recidiviz-data issue #33979

PR #18: Return duplicate error for duplicate dataset (Jun 2024)

  • Returns duplicate error when creating a duplicate dataset
  • Returns resourceInUse error when deleting dataset

PR #17: Return invalidQuery for invalid view queries (May 2024)

PR #16: Return invalidQuery to avoid retry (May 2024)

PR #15: Fix Table PATCH response (May 2024)

PR #14: Fix job materialization (Apr 2024)

  • Fixes job materialization when destination table does not yet exist

PR #13: Fix view creation with semicolon (Apr 2024)

  • Fixes view creation when query ends in semicolon

PR #7: Handle formatting timestamps when inserting (Feb 2024)

PR #6: Fix Too many SQL variables error (Feb 2024)

PR #5: Increase request timeout (Feb 2024)

  • Bumps timeout to avoid "failed to add job: job already created" errors
  • Complex queries were causing 15-second timeout and further requests would fail

PR #4: Respect maxResults parameter (Jan 2024)


Recidiviz/go-zetasqlite

Major Performance Improvements

PR #54: Delegate comparison operators to SQLite (Sep 2025)

  • Delegates comparison operators to SQLite when possible instead of calling go-zetasqlite UDFs
  • Calling UDFs is extremely expensive compared to native SQLite operations
  • Reduces function calls by orders of magnitude and allows SQLite query-plan optimizations
  • Profile shows dramatic reduction in functionArgs and convertArgs overhead

PR #53: Rewrite formatter to use object-based SQL generation (Sep 2025)

  • Rewrites formatter.go which walks ZetaSQL AST and transforms to SQLite queries
  • Uses object-based generation instead of string-based for greater flexibility
  • Tracks column propagation through AST and rewrites column names where applicable
  • Components: coordinator_extractor.go, transformer_*.go, coordinator_coordinator.go, sqlbuilder_sqlbuilder.go

PR #50: Index zetasqlite_catalog by update time (May 2025)

  • Catalog maintains list of tables, functions, etc and re-syncs per-connection
  • Without index, sync time grows linearly (nearly 10ms by end of view graph)
  • With index, sync takes sub-hundred nanoseconds

PR #44: Use WITHOUT ROWID table clustering (Apr 2024)

PR #43: Allow disabling query formatting; support named parameters (Apr 2024)

PR #32: Materialize CTEs for faster execution (Mar 2024)

PR #20: Rewrite window function implementation (Feb 2024)

  • Drastically improves performance by using real SQLite windows
  • No longer requires duplicate subqueries; delegates sorting/partitioning to SQLite
  • One ingest view: ~217k chars with 272 SELECT statements → ~117k chars with 78 SELECT
  • Implements numbering functions (RANK, DENSE_RANK, CUME_DIST) similar to SQLite/Postgres internals
  • Uses SQLite window function callbacks: xStep, xFinal, xValue, xInverse

Feature Additions

PR #51: Migrate to modernc.org/sqlite (May 2025)

  • Removes dependency on forked mattn sql driver
  • First-class support for user-defined window functions
  • Updates to latest SQLite release with many performance improvements

PR #49: Upgrade to Go 1.24 (May 2025)

PR #48: Add no-op mechanism for unsupported statements (Jan 2025)

  • Prevents emulator from returning 500 errors
  • Adds no-op for CreateRowAccessPolicyStmt and DropRowAccessPolicyStmt
  • Does not actually implement the statements

PR #45: Support CONTAINS_SUBSTR function (Apr 2024)

PR #42: Support LOGICAL_OR and LOGICAL_AND windows (Apr 2024)

PR #41: Fix QueryStmtNode output; wrap union statements (Apr 2024)

PR #40: Support sql.Driver's PrepareContext interface (Apr 2024)

PR #38: Enable QUALIFY without GROUP BY/WHERE/HAVING (Mar 2024)

PR #37: Support UNNEST WITH OFFSET (Mar 2024)

PR #36: Support PIVOT/UNPIVOT (Mar 2024)

PR #1: Suppress errors when SQL functions have SAFE. prefix (Apr 2023)

  • zetasql sets error mode to SafeMode for SAFE. prefix functions
  • Calls separate version that suppresses errors for normal functions
  • Throws error for aggregations/analytical functions (safe mode not supported)
  • Addresses goccy/bigquery-emulator upstream issue SAFE.PARSE_DATE doesn't return NULL on error #149

Date/Time Improvements

PR #52: Fix DATE_DIFF handling of diff remainder (May 2025)

  • Fixes results when input times differ by duration not evenly divisible by 24 hours but don't cross day boundary
  • Problem: function unconditionally incremented days when there was non-zero remainder
  • Fix: truncates end time to closest day before calculating difference
  • Corresponding PR opened in goccy/go-zetasqlite upstream repo How to connect the BQ emulator via dbeavor or dbvisualizer? #230

PR #39: Handle QUARTER, WEEK(DAY), ISOWEEK (Apr 2024)

  • Full range of dates 0001-01-01 to 9999-12-31 now supported (previously hit int64 caps)
  • QUARTER support for DATE_ADD
  • WEEK, WEEK(DAY OF WEEK) support for DATE_DIFF, DATE_TRUNC
  • ISOWEEK, ISOYEAR, QUARTER support for DATE_TRUNC
  • De-duplicates date function logic - DATETIME and TIMESTAMP defer to DATE methods

PR #35: Parse julian day of year (Mar 2024)

  • Implements dayOfYearParser
  • Tests assorted parsing from ingest and edge cases like leap year

PR #24: Fix %p 12pm case (Feb 2024)

PR #23: Capture all whitespace in date parser (Feb 2024)

PR #19: Implement %p, improve token composition (Feb 2024)

PR #11: Base date in parsing is start of unix time (Jan 2024)

PR #8: Implement %y year without century parser (Jan 2024)

  • %y: year without century as decimal (00-99) with optional leading zero
  • Years 00-68 are 2000s, 69-99 are 1900s
  • Handles incomplete digits

PR #7: Use microsecond precision for timestamps (Jan 2024)

  • Previously used nanoseconds which only covered partial range of supported timestamps
  • BigQuery only supports microsecond precision anyway

PR #4: Output correct date/time formats when casting to strings (May 2023)

  • When casting date/time to strings, emulator was using display format instead of BigQuery format
  • Updates functionality to match BigQuery behavior
  • Does not yet support format clauses when casting to strings (lower priority)

Bug Fixes - String Functions

PR #47: Trim all whitespace by default, not just spaces (Jun 2024)

PR #30: Do not cast integer/float-like strings to datetime (Feb 2024)

PR #29: Use direct string value from ZetaSQL (Feb 2024)

PR #16: Cast to INT64 should use base-10 parsing (Feb 2024)

PR #14: LIKE properly escapes regexp-characters (Feb 2024)

PR #5: Harden string functions when NULL is passed (May 2023)

  • Many string functions failed with "expected STRING or BYTES" when receiving NULL
  • Adds tests for NULL handling and updates implementation to match BigQuery (returns NULL or empty array)
  • Root cause of Recidiviz/recidiviz-data #20740 and many todos in #20752

Bug Fixes - Other

PR #46: Fix control flow execution of IF(), IFNULL() (Jun 2024)

PR #34: Don't crash on nil in LOGICAL_OR and LOGICAL_AND (Mar 2024)

PR #33: Ignore nulls when counting window values (Mar 2024)

PR #31: Return nil for null values in IN(), BETWEEN(), LIKE() (Mar 2024)

PR #28: Use value comparators for LEAST, GREATEST, BETWEEN (Feb 2024)

PR #27: Handle null ARRAY fields (Feb 2024)

PR #26: Support LEFT OUTER/INNER JOIN modes for arrays (Feb 2024)

PR #25: Correctly handle ordering multiple fields in aggregates (Feb 2024)

PR #22: Fix IN() operator return when left-hand side is null (Feb 2024)

PR #21: Fix prepared insert statements (Feb 2024)

PR #18: Fix NULLIF panic on null (Feb 2024)

PR #17: Reset format context between analytic function groups (Feb 2024)

PR #15: Fix syntax error with subselects or QUALIFY (Feb 2024)

PR #13: Fix ordinal boundary indexing for arrays (Feb 2024)

PR #12: Properly return nil for STRING_AGG (Feb 2024)

PR #10: Handle multiple sort expressions in windowing (Jan 2024)

PR #9: Handle null values in partitions (Jan 2024)

  • Inserts placeholder value for partitioning when rows contain null
  • Closes Recidiviz/recidiviz-data issue #20751

Testing Improvements

PR #6: Use UTC timezone in query_test.go (Jan 2024)


Summary

Recidiviz/bigquery-emulator: 34 PRs

  • Performance: 11 major improvements including API refactor, arm64 support, in-memory mode
  • Features: 8 additions including JDBC support, DROP VIEW support, JSON initialization
  • Bug Fixes: 14 fixes for view creation, job materialization, error handling, etc.
  • Dependencies: Regular updates to go-zetasqlite fork

Recidiviz/go-zetasqlite: 54 PRs

  • Performance: 8 major improvements including comparison operator delegation, formatter rewrite, catalog indexing, window function rewrite
  • Features: 11 additions including modernc.org/sqlite migration, Go 1.24, PIVOT/UNPIVOT, CONTAINS_SUBSTR
  • Date/Time: 11 improvements supporting full date ranges, various format specifiers, microsecond precision
  • String Functions: 7 fixes for trimming, casting, NULL handling, LIKE escaping
  • Other Bug Fixes: 16 fixes for window functions, aggregates, operators, control flow
  • Testing: 1 improvement for UTC timezone handling

@simi
Copy link

simi commented Oct 24, 2025

@ohaibbq is there any docker image publicly available to try your version? It looks amazing.

@ohaibbq
Copy link
Contributor

ohaibbq commented Oct 24, 2025

@simi
Copy link

simi commented Oct 24, 2025

@simi yes- https://github.com/Recidiviz/bigquery-emulator/releases/tag/v0.4.4-recidiviz.25 https://github.com/Recidiviz/bigquery-emulator/pkgs/container/bigquery-emulator

I have seen that one, but is it as simple as one docker pull (or entry in docker-compose) away from testing?

@ohaibbq
Copy link
Contributor

ohaibbq commented Oct 24, 2025

@simi yes- https://github.com/Recidiviz/bigquery-emulator/releases/tag/v0.4.4-recidiviz.25 https://github.com/Recidiviz/bigquery-emulator/pkgs/container/bigquery-emulator

I have seen that one, but is it as simple as one docker pull (or entry in docker-compose) away from testing?

Yes- it can be a drop-in replacement, though the upstream version it is based off of is v0.4.4 not latest v0.6.6

@pgcamus
Copy link
Contributor Author

pgcamus commented Oct 27, 2025

Incredible work @ohaibbq and team! Our list of fixes is quite a bit more more modest by comparison

  • Fix for Location headers being broken in resumable upload response; prevented running behind a NAT or in a standalone container
  • Support for copy jobs
  • Properly quote RECORD field names when creating a table

I would like to figure out how to add support for WITH RECURSIVE to bigquery-emulator / go-zetasqlite, as that is the major missing piece of functionality keeping some of our tests on GCP BigQuery and off of the emulator. @ohaibbq if you or someone else on your team have 15m to walk me through some of the relevant codepaths, I can take a stab at a PR for this.

ohaibbq pushed a commit to Recidiviz/bigquery-emulator that referenced this pull request Oct 29, 2025
* Run discovery.json through jq -S

* Import newest discovery document from Google

- curl -s https://bigquery.googleapis.com/discovery/v1/apis/bigquery/v2/rest | \
	jq -S >server/resources/discovery.json
- Emptied some fields to minimize the diff: basePath, baseUrl,
  batchPath, mtlsRootUrl, and servicePath
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants