Skip to content

Conversation

@claudevdm
Copy link
Collaborator

@claudevdm claudevdm commented Dec 11, 2025

Add read timestamp precision setting for storage api reads.

The storage API allows reading TIMESTAMP(12) columns with MICRO (default), NANOS or PICOS precision for both AVRO and ARROW formats.

This propagates the read precision setting to the storage API, and adds relevant tests.

Known Issue:
Arrow readTableRows and readTableRowsWithSchema converts arrow records to beam rows via ArrowConversion.java.

ArrowConversion is a generic utility for arrow -> beam schema, it does not take into account the bigquery schema.

Even before this PR, arrow format with readTableRows truncates timestamps to millisecond precision because millis and micro timestamps were historically mapped to FieldType.DATETIME

For avro readTableRowsWithSchema this is not an issue because we can map timestamp-micros to timestamp logical type if the bigquery schema is TIMESTAMP(12) with read precision micros


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @claudevdm, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the BigQueryIO connector by introducing explicit control over timestamp precision when reading data using the BigQuery Storage Read API. This allows users to specify whether TIMESTAMP(12) columns should be read with nanosecond or picosecond precision, ensuring data fidelity for high-precision timestamp values. The changes involve adding a new configuration option and integrating it into the BigQuery Storage API request generation, along with thorough testing to cover various data formats and precision levels.

Highlights

  • Timestamp Precision Configuration: Introduced a new withTimestampPrecision option in BigQueryIO.TypedRead to allow users to specify the desired precision for TIMESTAMP(12) columns when reading via the BigQuery Storage Read API. This option supports nanosecond and picosecond precision.
  • BigQuery Storage API Integration: Modified internal BigQuery Storage Source classes (BigQueryStorageQuerySource, BigQueryStorageTableSource, BigQueryStorageSourceBase) to propagate and apply the TimestampPrecision setting to the underlying BigQuery Storage Read Session requests.
  • Serialization Options for Precision: The implementation now explicitly configures ArrowSerializationOptions and AvroSerializationOptions to request TIMESTAMP_PRECISION_NANOS or TIMESTAMP_PRECISION_PICOS from the BigQuery Storage API, ensuring that the specified precision is honored during data retrieval.
  • Comprehensive Testing: Added extensive unit tests to validate the new TimestampPrecision functionality across different data formats (Avro, Arrow) and read methods (readTableRows, readTableRowsWithSchema), covering default, micros, nanos, and picos precision scenarios.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@claudevdm claudevdm changed the title initial Add timestamp precision option to bigquery storage read for TIMESTAMP(12) columns. Dec 11, 2025
@claudevdm
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new option to specify timestamp precision when reading from BigQuery using the Storage Read API. The changes are well-implemented across the relevant classes, and the addition of comprehensive tests is excellent. I have a couple of suggestions: one to correct the default value mentioned in a Javadoc, and another to refactor a new method to improve its structure and reduce code duplication.

@claudevdm claudevdm marked this pull request as ready for review December 11, 2025 17:57
@claudevdm
Copy link
Collaborator Author

R: @Abacn

@claudevdm
Copy link
Collaborator Author

R: @ahmedabu98

@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

1 similar comment
@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

@claudevdm claudevdm requested a review from damccorm December 12, 2025 21:58
@damccorm
Copy link
Contributor

/gemini review

Copy link
Contributor

@damccorm damccorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, just had a naming question

SerializableFunction<SchemaAndRecord, T> parseFn,
Coder<T> outputCoder,
BigQueryServices bqServices,
@Nullable TimestampPrecision picosTimestampPrecision) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we're naming this picosTimestampPrecision? Couldn't it take in non-picos precisions?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature for controlling timestamp precision during BigQuery storage reads for TIMESTAMP(12) columns. The implementation is well-structured, propagating the new option from the user-facing API down to the storage read session creation. The accompanying tests are comprehensive and cover a wide range of scenarios. I've identified one critical issue regarding serialization that could break portability, along with a few medium-severity suggestions to improve code robustness and maintainability. Overall, this is a solid contribution.

Comment on lines +200 to +201
fieldValues.put(
"timestamp_precision", toByteArray(transform.getDirectReadPicosTimestampPrecision()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a mismatch between the field name used here for serialization (timestamp_precision) and the name defined in the schema and used for deserialization (direct_read_picos_timestamp_precision). This will cause portability issues as the serialized value won't be correctly read back. Please use the correct field name to ensure the transform can be correctly serialized and deserialized across different contexts.

Suggested change
fieldValues.put(
"timestamp_precision", toByteArray(transform.getDirectReadPicosTimestampPrecision()));
fieldValues.put(
"direct_read_picos_timestamp_precision", toByteArray(transform.getDirectReadPicosTimestampPrecision()));

Comment on lines +225 to +237
switch (checkNotNull(picosTimestampPrecision)) {
case MICROS:
precision = ArrowSerializationOptions.PicosTimestampPrecision.TIMESTAMP_PRECISION_MICROS;
break;
case NANOS:
precision = ArrowSerializationOptions.PicosTimestampPrecision.TIMESTAMP_PRECISION_NANOS;
break;
case PICOS:
precision = ArrowSerializationOptions.PicosTimestampPrecision.TIMESTAMP_PRECISION_PICOS;
break;
default:
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This method can be slightly improved for clarity and robustness:

  1. The checkNotNull(picosTimestampPrecision) is redundant because picosTimestampPrecision is checked for null in the calling method setPicosTimestampPrecision.
  2. The switch statement is not exhaustive for the TimestampPrecision enum (it's missing MILLIS). The default case currently returns silently, which could hide bugs if new enum values are added in the future. It would be safer to explicitly handle MILLIS (if it's intentionally unsupported) and have a default case that throws an exception for any other unhandled values.
    switch (picosTimestampPrecision) {
      case MICROS:
        precision = ArrowSerializationOptions.PicosTimestampPrecision.TIMESTAMP_PRECISION_MICROS;
        break;
      case NANOS:
        precision = ArrowSerializationOptions.PicosTimestampPrecision.TIMESTAMP_PRECISION_NANOS;
        break;
      case PICOS:
        precision = ArrowSerializationOptions.PicosTimestampPrecision.TIMESTAMP_PRECISION_PICOS;
        break;
      case MILLIS:
        // The Storage Read API does not support millisecond precision for TIMESTAMP(12) columns.
        return;
      default:
        throw new IllegalArgumentException(
            "Unsupported timestamp precision for Storage Read API: " + picosTimestampPrecision);
    }

Comment on lines +245 to +257
switch (checkNotNull(picosTimestampPrecision)) {
case MICROS:
precision = AvroSerializationOptions.PicosTimestampPrecision.TIMESTAMP_PRECISION_MICROS;
break;
case NANOS:
precision = AvroSerializationOptions.PicosTimestampPrecision.TIMESTAMP_PRECISION_NANOS;
break;
case PICOS:
precision = AvroSerializationOptions.PicosTimestampPrecision.TIMESTAMP_PRECISION_PICOS;
break;
default:
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to setArrowTimestampPrecision, this method can be improved by removing the redundant checkNotNull and making the switch statement more robust against future changes to the TimestampPrecision enum by handling all cases explicitly or throwing an exception for unhandled ones.

    switch (picosTimestampPrecision) {
      case MICROS:
        precision = AvroSerializationOptions.PicosTimestampPrecision.TIMESTAMP_PRECISION_MICROS;
        break;
      case NANOS:
        precision = AvroSerializationOptions.PicosTimestampPrecision.TIMESTAMP_PRECISION_NANOS;
        break;
      case PICOS:
        precision = AvroSerializationOptions.PicosTimestampPrecision.TIMESTAMP_PRECISION_PICOS;
        break;
      case MILLIS:
        // The Storage Read API does not support millisecond precision for TIMESTAMP(12) columns.
        return;
      default:
        throw new IllegalArgumentException(
            "Unsupported timestamp precision for Storage Read API: " + picosTimestampPrecision);
    }

Comment on lines +2540 to +2548
switch (precision) {
case NANOS:
return ARROW_SCHEMA_TS_NANOS;
case PICOS:
return ARROW_SCHEMA_TS_PICOS;
case MICROS:
default:
return ARROW_SCHEMA_TS_MICROS;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default case here will treat any unhandled TimestampPrecision enum value as MICROS. This could lead to subtle bugs if new precisions are added. It would be safer to make the switch statement exhaustive or throw an exception in the default case to ensure all enum values are explicitly handled. This applies to other similar helper methods in this test file as well (e.g., getAvroSchemaTs, createArrowTsResponse).

    switch (precision) {
      case NANOS:
        return ARROW_SCHEMA_TS_NANOS;
      case PICOS:
        return ARROW_SCHEMA_TS_PICOS;
      case MICROS:
        return ARROW_SCHEMA_TS_MICROS;
      default:
        throw new IllegalArgumentException("Unsupported timestamp precision: " + precision);
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants