Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase consistency for BQ typed read with avro source #27971

Closed
wants to merge 7 commits into from

Conversation

RustedBones
Copy link
Contributor

Fix #26329

  • Add extra AvroT type parameter in AvroSource to memorize the materialized avro record type (the one produced by the avro reading).
  • Add possibility to combine both AvroDatumReader and parseFn in the source.
  • Add back parseFn to BigqueryIO.TypedRead so it behaves as the AvroSource.
  • Require coder in the API when reading custom types.

(AvroSource<T>)
AvroSource.from(file.toString())
.withSchema(avroSchema)
.withDatumReaderFactory(factory));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the factory returns a type T and coder is not propagated

@github-actions
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @bvolpato for label java.
R: @Abacn for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@Abacn
Copy link
Contributor

Abacn commented Aug 16, 2023

Thanks for solving this long-standing issue!

The PR looks pretty solid to me. One concern is that could it be substantial breaking change if changing the signature from AvroSource to AvroSource<AvroT, T>. As Java Generic does not support default (e.g. class A<T, V=T>), one solution for less singature change is to define AvroSource extends CodableAvroSource<T, T>; or make CodableAvroSource<AvroT, T> extends AvroSource (did a little bit search: https://www.google.com/search?q=codable&oq=codable&aqs=chrome..69i57j0i512l6j69i64.1163j0j4&sourceid=chrome&ie=UTF-8)

The goal is make the change needed for the code base smaller / which also means the possibility user needs to rewrite their pipeline smaller

Comment on lines +272 to 277
if (getMode() == SINGLE_FILE_OR_SUBRANGE) {
// emptyMatchTreatment is unused for mode SINGLE_FILE_OR_SUBRANGE
return this;
}
return new AvroSource<>(
getFileOrPatternSpecProvider(), emptyMatchTreatment, getMinBundleSize(), mode);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new source was created, regardless of the mode, which may extract wrong parameters in case of SINGLE_FILE_OR_SUBRANGE. Looking at parent constructor, field is unused in that case

@RustedBones
Copy link
Contributor Author

I tried to avoid breaking changes as much as possible, but this gets extremely complex, or create lots of code duplication to go around that.
I was hoping since AvroSource has the following javadoc

Do not use in pipelines directly: most users should use {@link AvroIO.Read}.

breaking would be 'fine'

@Abacn
Copy link
Contributor

Abacn commented Aug 23, 2023

I see the difficulty here. Thanks again for addressing this. In this case, would you mind sending the proposed change to Beam devlist for the discussion to the change of AvroSource?

Also CC: @aromanenko-dev who worked heavily on avro for thoughts

@RustedBones
Copy link
Contributor Author

Will do! In the meantime, can you check the linked PR ?
This aims to solve the bug without breaking changes related to BQ/Avro source discrepancies.

@github-actions
Copy link
Contributor

Reminder, please take a look at this pr: @bvolpato @Abacn

@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2023

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.
R: @ahmedabu98 for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

@aromanenko-dev
Copy link
Contributor

Sorry for delay with a response. What is a current state of this PR? Is it blocked by something?

@Abacn
Copy link
Contributor

Abacn commented Sep 7, 2023

If I understood correctly the original issue was resolved by another approach #28143

@aromanenko-dev
Copy link
Contributor

Yes, it looks so. Should we close this one then?

@RustedBones
Copy link
Contributor Author

I did not have time to start the discussion on the mailing list to discuss the breaking changes and unify the API between Avro and BQ Avro dump. We can close this one, and in case this gets accepted, I'll re-open a new PR.

@RustedBones RustedBones closed this Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: BigQuerySourceBase does not propagate a Coder to AvroSource
3 participants