Releases: apache/beam
Beam 2.55.0 release
We are happy to present the new 2.55.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.55.0, check out the detailed release notes.
Highlights
- The Python SDK will now include automatically generated wrappers for external Java transforms! (#29834)
I/Os
- Added support for handling bad records to BigQueryIO (#30081).
- Full Support for Storage Read and Write APIs
- Partial Support for File Loads (Failures writing to files supported, failures loading files to BQ unsupported)
- No Support for Extract or Streaming Inserts
- Added support for handling bad records to PubSubIO (#30372).
- Support is not available for handling schema mismatches, and enabling error handling for writing to Pub/Sub topics with schemas is not recommended
--enableBundling
pipeline option for BigQueryIO DIRECT_READ is replaced by--enableStorageReadApiV2
. Both were considered experimental and subject to change (Java) (#26354).
New Features / Improvements
- Allow writing clustered and not time-partitioned BigQuery tables (Java) (#30094).
- Redis cache support added to RequestResponseIO and Enrichment transform (Python) (#30307)
- Merged
sdks/java/fn-execution
andrunners/core-construction-java
into the main SDK. These artifacts were never meant for users, but noting
that they no longer exist. These are steps to bring portability into the core SDK alongside all other core functionality. - Added Vertex AI Feature Store handler for Enrichment transform (Python) (#30388)
Breaking Changes
- Arrow version was bumped to 15.0.0 from 5.0.0 (#30181).
- Go SDK users who build custom worker containers may run into issues with the move to distroless containers as a base (see Security Fixes).
- The issue stems from distroless containers lacking additional tools, which current custom container processes may rely on.
- See https://beam.apache.org/documentation/runtime/environments/#from-scratch-go for instructions on building and using a custom container.
- Python SDK has changed the default value for the
--max_cache_memory_usage_mb
pipeline option from 100 to 0. This option was first introduced in the 2.52.0 SDK version. This change restores the behavior of the 2.51.0 SDK, which does not use the state cache. If your pipeline uses iterable side inputs views, consider increasing the cache size by setting the option manually. (#30360).
Deprecations
- N/A
Bug fixes
- Fixed
SpannerIO.readChangeStream
to support propagating credentials from pipeline options
to thegetDialect
calls for authenticating with Spanner (Java) (#30361). - Reduced the number of HTTP requests in GCSIO function calls (Python) (#30205)
Security Fixes
- Go SDK base container image moved to distroless/base-nossl-debian12, reducing vulnerable container surface to kernel and glibc (#30011).
Known Issues
- In Python pipelines, when shutting down inactive bundle processors, shutdown logic can overaggressively hold the lock, blocking acceptance of new work. Symptoms of this issue include slowness or stuckness in long-running jobs. Fixed in 2.56.0 (#30679).
List of Contributors
According to git shortlog, the following people contributed to the {$RELEASE_VERSION} release. Thank you to all contributors!
Ahmed Abualsaud
Anand Inguva
Andrew Crites
Andrey Devyatkin
Arun Pandian
Arvind Ram
Chamikara Jayalath
Chris Gray
Claire McGinty
Damon Douglas
Dan Ellis
Danny McCormick
Daria Bezkorovaina
Dima I
Edward Cui
Ferran Fernández Garrido
GStravinsky
Jan Lukavský
Jason Mitchell
JayajP
Jeff Kinard
Jeffrey Kinard
Kenneth Knowles
Mattie Fu
Michel Davit
Oleh Borysevych
Ritesh Ghorse
Ritesh Tarway
Robert Bradshaw
Robert Burke
Sam Whittle
Scott Strong
Shunping Huang
Steven van Rossum
Svetak Sundhar
Talat UYARER
Ukjae Jeong (Jay)
Vitaly Terentyev
Vlado Djerek
Yi Hu
akashorabek
case-k
clmccart
dengwe1
dhruvdua
hardshah
johnjcasey
liferoad
martin trieu
tvalentyn
Beam 2.54.0 release
We are happy to present the new 2.54.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.54.0, check out the detailed release notes.
Highlights
- Enrichment Transform along with GCP BigTable handler added to Python SDK (#30001).
- Beam Java Batch pipelines run on Google Cloud Dataflow will default to the Portable (Runner V2)[https://cloud.google.com/dataflow/docs/runner-v2] starting with this version. (All other languages are already on Runner V2.)
- This change is still rolling out to the Dataflow service, see (Runner V2 documentation)[https://cloud.google.com/dataflow/docs/runner-v2] for how to enable or disable it intentionally.
I/Os
- Added support for writing to BigQuery dynamic destinations with Python's Storage Write API (#30045)
- Adding support for Tuples DataType in ClickHouse (Java) (#29715).
- Added support for handling bad records to FileIO, TextIO, AvroIO (#29670).
- Added support for handling bad records to BigtableIO (#29885).
New Features / Improvements
- Enrichment Transform along with GCP BigTable handler added to Python SDK (#30001).
Breaking Changes
- N/A
Deprecations
- N/A
Bugfixes
- Fixed a memory leak affecting some Go SDK since 2.46.0. (#28142)
Security Fixes
- N/A
Known Issues
- N/A
List of Contributors
According to git shortlog, the following people contributed to the 2.54.0 release. Thank you to all contributors!
Ahmed Abualsaud
Alexey Romanenko
Anand Inguva
Andrew Crites
Arun Pandian
Bruno Volpato
caneff
Chamikara Jayalath
Changyu Li
Cheskel Twersky
Claire McGinty
clmccart
Damon
Danny McCormick
dependabot[bot]
Edward Cheng
Ferran Fernández Garrido
Hai Joey Tran
hugo-syn
Issac
Jack McCluskey
Jan Lukavský
JayajP
Jeffrey Kinard
Jerry Wang
Jing
Joey Tran
johnjcasey
Kenneth Knowles
Knut Olav Løite
liferoad
Marc
Mark Zitnik
martin trieu
Mattie Fu
Naireen Hussain
Neeraj Bansal
Niel Markwick
Oleh Borysevych
pablo rodriguez defino
Rebecca Szper
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Sam Whittle
Shunping Huang
Svetak Sundhar
S. Veyrié
Talat UYARER
tvalentyn
Vlado Djerek
Yi Hu
Zechen Jian
Beam 2.53.0 release
We are happy to present the new 2.53.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.53.0, check out the detailed release notes.
Highlights
- Python streaming users that use 2.47.0 and newer versions of Beam should update to version 2.53.0, which fixes a known issue: (#27330).
I/Os
- TextIO now supports skipping multiple header lines (Java) (#17990).
- Python GCSIO is now implemented with GCP GCS Client instead of apitools (#25676)
- Adding support for LowCardinality DataType in ClickHouse (Java) (#29533).
- Added support for handling bad records to KafkaIO (Java) (#29546)
- Add support for generating text embeddings in MLTransform for Vertex AI and Hugging Face Hub models.(#29564)
- NATS IO connector added (Go) (#29000).
New Features / Improvements
- The Python SDK now type checks
collections.abc.Collections
types properly. Some type hints that were erroneously allowed by the SDK may now fail. (#29272) - Running multi-language pipelines locally no longer requires Docker.
Instead, the same (generally auto-started) subprocess used to perform the
expansion can also be used as the cross-language worker. - Framework for adding Error Handlers to composite transforms added in Java (#29164).
- Python 3.11 images now include google-cloud-profiler (#29561).
Breaking Changes
- Upgraded to go 1.21.5 to build, fixing CVE-2023-45285 and CVE-2023-39326
Deprecations
- Euphoria DSL is deprecated and will be removed in a future release (not before 2.56.0) (#29451)
Bugfixes
- (Python) Fixed sporadic crashes in streaming pipelines that affected some users of 2.47.0 and newer SDKs (#27330).
- (Python) Fixed a bug that caused MLTransform to drop identical elements in the output PCollection (#29600).
List of Contributors
According to git shortlog, the following people contributed to the 2.53.0 release. Thank you to all contributors!
Ahmed Abualsaud
Ahmet Altay
Alexey Romanenko
Anand Inguva
Arun Pandian
Balázs Németh
Bruno Volpato
Byron Ellis
Calvin Swenson Jr
Chamikara Jayalath
Clay Johnson
Damon
Danny McCormick
Ferran Fernández Garrido
Georgii Zemlianyi
Israel Herraiz
Jack McCluskey
Jacob Tomlinson
Jan Lukavský
JayajP
Jeffrey Kinard
Johanna Öjeling
Julian Braha
Julien Tournay
Kenneth Knowles
Lawrence Qiu
Mark Zitnik
Mattie Fu
Michel Davit
Mike Williamson
Naireen
Naireen Hussain
Niel Markwick
Pablo Estrada
Radosław Stankiewicz
Rebecca Szper
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Sam Rohde
Sam Whittle
Shunping Huang
Svetak Sundhar
Talat UYARER
Tom Stepp
Tony Tang
Vlado Djerek
Yi Hu
Zechen Jiang
clmccart
damccorm
darshan-sj
gabry.wu
johnjcasey
liferoad
lrakla
martin trieu
tvalentyn
Beam 2.52.0 release
We are happy to present the new 2.52.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.52.0, check out the detailed release notes.
Highlights
- Previously deprecated Avro-dependent code (Beam Release 2.46.0) has been finally removed from Java SDK "core" package.
Please, usebeam-sdks-java-extensions-avro
instead. This will allow to easily update Avro version in user code without
potential breaking changes in Beam "core" since the Beam Avro extension already supports the latest Avro versions and
should handle this. (#25252). - Publishing Java 21 SDK container images now supported as part of Apache Beam release process. (#28120)
- Direct Runner and Dataflow Runner support running pipelines on Java21 (experimental until tests fully setup). For other runners (Flink, Spark, Samza, etc) support status depend on runner projects.
New Features / Improvements
- Add
UseDataStreamForBatch
pipeline option to the Flink runner. When it is set to true, Flink runner will run batch
jobs using the DataStream API. By default the option is set to false, so the batch jobs are still executed
using the DataSet API. upload_graph
as one of the Experiments options for DataflowRunner is no longer required when the graph is larger than 10MB for Java SDK (PR#28621.- state amd side input cache has been enabled to a default of 100 MB. Use
--max_cache_memory_usage_mb=X
to provide cache size for the user state API and side inputs. (Python) (#28770). - Beam YAML stable release. Beam pipelines can now be written using YAML and leverage the Beam YAML framework which includes a preliminary set of IO's and turnkey transforms. More information can be found in the YAML root folder and in the README.
Breaking Changes
org.apache.beam.sdk.io.CountingSource.CounterMark
uses customCounterMarkCoder
as a default coder since all Avro-dependent
classes finally moved toextensions/avro
. In case if it's still required to useAvroCoder
forCounterMark
, then,
as a workaround, a copy of "old"CountingSource
class should be placed into a project code and used directly
(#25252).- Renamed
host
tofirestoreHost
inFirestoreOptions
to avoid potential conflict of command line arguments (Java) (#29201).
Bugfixes
- Fixed "Desired bundle size 0 bytes must be greater than 0" in Java SDK's BigtableIO.BigtableSource when you have more cores than bytes to read (Java) #28793.
watch_file_pattern
arg of the RunInference arg had no effect prior to 2.52.0. To use the behavior of argwatch_file_pattern
prior to 2.52.0, follow the documentation at https://beam.apache.org/documentation/ml/side-input-updates/ and useWatchFilePattern
PTransform as a SideInput. (#28948)MLTransform
doesn't output artifacts such as min, max and quantiles. Instead,MLTransform
will add a feature to output these artifacts as human readable format - #29017. For now, to use the artifacts such as min and max that were produced by the earilerMLTransform
, useread_artifact_location
ofMLTransform
, which reads artifacts that were produced earlier in a differentMLTransform
(#29016)- Fixed a memory leak, which affected some long-running Python pipelines: #28246.
Security Fixes
- Fixed CVE-2023-39325 (Java/Python/Go) (#29118).
- Mitigated CVE-2023-47248 (Python) #29392.
List of Contributors
According to git shortlog, the following people contributed to the 2.52.0 release. Thank you to all contributors!
Ahmed Abualsaud
Ahmet Altay
Aleksandr Dudko
Alexey Romanenko
Anand Inguva
Andrei Gurau
Andrey Devyatkin
BjornPrime
Bruno Volpato
Bulat
Chamikara Jayalath
Damon
Danny McCormick
Devansh Modi
Dominik Dębowczyk
Ferran Fernández Garrido
Hai Joey Tran
Israel Herraiz
Jack McCluskey
Jan Lukavský
JayajP
Jeff Kinard
Jeffrey Kinard
Jiangjie Qin
Jing
Joar Wandborg
Johanna Öjeling
Julien Tournay
Kanishk Karanawat
Kenneth Knowles
Kerry Donny-Clark
Luís Bianchin
Minbo Bae
Pranav Bhandari
Rebecca Szper
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
RyuSA
Shunping Huang
Steven van Rossum
Svetak Sundhar
Tony Tang
Vitaly Terentyev
Vivek Sumanth
Vlado Djerek
Yi Hu
aku019
brucearctor
caneff
damccorm
ddebowczyk92
dependabot[bot]
dpcollins-google
edman124
gabry.wu
illoise
johnjcasey
jonathan-lemos
kennknowles
liferoad
magicgoody
martin trieu
nancyxu123
pablo rodriguez defino
tvalentyn
Beam 2.51.0 release
We are happy to present the new 2.51.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.51.0, check out the detailed release notes.
New Features / Improvements
- In Python, RunInference now supports loading many models in the same transform using a KeyedModelHandler (#27628).
- In Python, the VertexAIModelHandlerJSON now supports passing in inference_args. These will be passed through to the Vertex endpoint as parameters.
- Added support to run
mypy
on user pipelines (#27906)
Breaking Changes
- Removed fastjson library dependency for Beam SQL. Table property is changed to be based on jackson ObjectNode (Java) (#24154).
- Removed TensorFlow from Beam Python container images PR. If you have been negatively affected by this change, please comment on #20605.
- Removed the parameter
t reflect.Type
fromparquetio.Write
. The element type is derived from the input PCollection (Go) (#28490) - Refactor BeamSqlSeekableTable.setUp adding a parameter joinSubsetType. #28283
Bugfixes
- Fixed exception chaining issue in GCS connector (Python) (#26769).
- Fixed streaming inserts exception handling, GoogleAPICallErrors are now retried according to retry strategy and routed to failed rows where appropriate rather than causing a pipeline error (Python) (#21080).
- Fixed a bug in Python SDK's cross-language Bigtable sink that mishandled records that don't have an explicit timestamp set: #28632.
Security Fixes
- Python containers updated, fixing CVE-2021-30474, CVE-2021-30475, CVE-2021-30473, CVE-2020-36133, CVE-2020-36131, CVE-2020-36130, and CVE-2020-36135
- Used go 1.21.1 to build, fixing CVE-2023-39320
Known Issues
- Python pipelines using BigQuery Storage Read API must pin
fastavro
dependency to 1.8.3
or earlier: #28811
List of Contributors
According to git shortlog, the following people contributed to the 2.50.0 release. Thank you to all contributors!
Adam Whitmore
Ahmed Abualsaud
Ahmet Altay
Aleksandr Dudko
Alexey Romanenko
Anand Inguva
Andrey Devyatkin
Arvind Ram
Arwin Tio
BjornPrime
Bruno Volpato
Bulat
Celeste Zeng
Chamikara Jayalath
Clay Johnson
Damon
Danny McCormick
David Cavazos
Dip Patel
Hai Joey Tran
Hao Xu
Haruka Abe
Jack Dingilian
Jack McCluskey
Jeff Kinard
Jeffrey Kinard
Joey Tran
Johanna Öjeling
Julien Tournay
Kenneth Knowles
Kerry Donny-Clark
Mattie Fu
Melissa Pashniak
Michel Davit
Moritz Mack
Pranav Bhandari
Rebecca Szper
Reeba Qureshi
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Ruwann
Ryan Tam
Sam Rohde
Sereana Seim
Svetak Sundhar
Tim Grein
Udi Meiri
Valentyn Tymofieiev
Vitaly Terentyev
Vlado Djerek
Xinyu Liu
Yi Hu
Zbynek Konecny
Zechen Jiang
bzablocki
caneff
dependabot[bot]
gDuperran
gabry.wu
johnjcasey
kberezin-nshl
kennknowles
liferoad
lostluck
magicgoody
martin trieu
mosche
olalamichelle
tvalentyn
xqhu
Łukasz Spyra
Beam 2.50.0 release
We are happy to present the new 2.50.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.50.0, check out the detailed release notes.
Highlights
- Spark 3.2.2 is used as default version for Spark runner (#23804).
- The Go SDK has a new default local runner, called Prism (#24789).
- All Beam released container images are now multi-arch images that support both x86 and ARM CPU architectures.
I/Os
- Java KafkaIO now supports picking up topics via topicPattern (#26948)
- Support for read from Cosmos DB Core SQL API (#23604)
- Upgraded to HBase 2.5.5 for HBaseIO. (Java) (#27711)
- Added support for GoogleAdsIO source (Java) (#27681).
New Features / Improvements
- The Go SDK now requires Go 1.20 to build. (#27558)
- The Go SDK has a new default local runner, Prism. (#24789).
- Prism is a portable runner that executes each transform independantly, ensuring coders.
- At this point it supercedes the Go direct runner in functionality. The Go direct runner is now deprecated.
- See https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/prism/README.md for the goals and features of Prism.
- Hugging Face Model Handler for RunInference added to Python SDK. (#26632)
- Hugging Face Pipelines support for RunInference added to Python SDK. (#27399)
- Vertex AI Model Handler for RunInference now supports private endpoints (#27696)
- MLTransform transform added with support for common ML pre/postprocessing operations (#26795)
- Upgraded the Kryo extension for the Java SDK to Kryo 5.5.0. This brings in bug fixes, performance improvements, and serialization of Java 14 records. (#27635)
- All Beam released container images are now multi-arch images that support both x86 and ARM CPU architectures. (#27674). The multi-arch container images include:
- All versions of Go, Python, Java and Typescript SDK containers.
- All versions of Flink job server containers.
- Java and Python expansion service containers.
- Transform service controller container.
- Spark3 job server container.
- Added support for batched writes to AWS SQS for improved throughput (Java, AWS 2).(#21429)
Breaking Changes
- Python SDK: Legacy runner support removed from Dataflow, all pipelines must use runner v2.
- Python SDK: Dataflow Runner will no longer stage Beam SDK from PyPI in the
--staging_location
at pipeline submission. Custom container images that are not based on Beam's default image must include Apache Beam installation.(#26996)
Deprecations
- The Go Direct Runner is now Deprecated. It remains available to reduce migration churn.
- Tests can be set back to the direct runner by overriding TestMain:
func TestMain(m *testing.M) { ptest.MainWithDefault(m, "direct") }
- It's recommended to fix issues seen in tests using Prism, as they can also happen on any portable runner.
- Use the generic register package for your pipeline DoFns to ensure pipelines function on portable runners, like prism.
- Do not rely on closures or using package globals for DoFn configuration. They don't function on portable runners.
- Tests can be set back to the direct runner by overriding TestMain:
Bugfixes
- Fixed DirectRunner bug in Python SDK where GroupByKey gets empty PCollection and fails when pipeline option
direct_num_workers!=1
.(#27373) - Fixed BigQuery I/O bug when estimating size on queries that utilize row-level security (#27474)
List of Contributors
According to git shortlog, the following people contributed to the 2.50.0 release. Thank you to all contributors!
Abacn
acejune
AdalbertMemSQL
ahmedabu98
Ahmed Abualsaud
al97
Aleksandr Dudko
Alexey Romanenko
Anand Inguva
Andrey Devyatkin
Anton Shalkovich
ArjunGHUB
Bjorn Pedersen
BjornPrime
Brett Morgan
Bruno Volpato
Buqian Zheng
Burke Davison
Byron Ellis
bzablocki
case-k
Celeste Zeng
Chamikara Jayalath
Clay Johnson
Connor Brett
Damon
Damon Douglas
Dan Hansen
Danny McCormick
Darkhan Nausharipov
Dip Patel
Dmytro Sadovnychyi
Florent Biville
Gabriel Lacroix
Hai Joey Tran
Hong Liang Teoh
Jack McCluskey
James Fricker
Jeff Kinard
Jeff Zhang
Jing
johnjcasey
jon esperanza
Josef Šimánek
Kenneth Knowles
Laksh
Liam Miller-Cushon
liferoad
magicgoody
Mahmud Ridwan
Manav Garg
Marco Vela
martin trieu
Mattie Fu
Michel Davit
Moritz Mack
mosche
Peter Sobot
Pranav Bhandari
Reeba Qureshi
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
RyuSA
Saba Sathya
Sam Whittle
Steven Niemitz
Steven van Rossum
Svetak Sundhar
Tony Tang
Valentyn Tymofieiev
Vitaly Terentyev
Vlado Djerek
Yichi Zhang
Yi Hu
Zechen Jiang
Beam 2.49.0 release
We are happy to present the new 2.49.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.49.0, check out the detailed release notes.
I/Os
- Support for Bigtable Change Streams added in Java
BigtableIO.ReadChangeStream
(#27183). - Added Bigtable Read and Write cross-language transforms to Python SDK ((#26593), (#27146)).
New Features / Improvements
- Allow prebuilding large images when using
--prebuild_sdk_container_engine=cloud_build
, like images depending ontensorflow
ortorch
(#27023). - Disabled
pip
cache when installing packages on the workers. This reduces the size of prebuilt Python container images (#27035). - Select dedicated avro datum reader and writer (Java) (#18874).
- Timer API for the Go SDK (Go) (#22737).
Deprecations
- Remove Python 3.7 support. (#26447)
Bugfixes
- Fixed KinesisIO
NullPointerException
when a progress check is made before the reader is started (IO) (#23868)
Known Issues
List of Contributors
According to git shortlog, the following people contributed to the 2.49.0 release. Thank you to all contributors!
Abzal Tuganbay
AdalbertMemSQL
Ahmed Abualsaud
Ahmet Altay
Alan Zhang
Alexey Romanenko
Anand Inguva
Andrei Gurau
Arwin Tio
Bartosz Zablocki
Bruno Volpato
Burke Davison
Byron Ellis
Chamikara Jayalath
Charles Rothrock
Chris Gavin
Claire McGinty
Clay Johnson
Damon
Daniel Dopierała
Danny McCormick
Darkhan Nausharipov
David Cavazos
Dip Patel
Dmitry Repin
Gavin McDonald
Jack Dingilian
Jack McCluskey
James Fricker
Jan Lukavský
Jasper Van den Bossche
John Casey
John Gill
Joseph Crowley
Kanishk Karanawat
Katie Liu
Kenneth Knowles
Kyle Galloway
Liam Miller-Cushon
MakarkinSAkvelon
Masato Nakamura
Mattie Fu
Michel Davit
Naireen Hussain
Nathaniel Young
Nelson Osacky
Nick Li
Oleh Borysevych
Pablo Estrada
Reeba Qureshi
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Rouslan
Saadat Su
Sam Rohde
Sam Whittle
Sanil Jain
Shunping Huang
Smeet nagda
Svetak Sundhar
Timur Sultanov
Udi Meiri
Valentyn Tymofieiev
Vlado Djerek
WuA
XQ Hu
Xianhua Liu
Xinyu Liu
Yi Hu
Zachary Houfek
alexeyinkin
bigduu
bullet03
bzablocki
jonathan-lemos
jubebo
magicgoody
ruslan-ikhsan
sultanalieva-s
vitaly.terentyev
Beam 2.48.0 release
We are happy to present the new 2.48.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.48.0, check out the detailed release notes.
Note: The release tag for Go SDK for this release is sdks/v2.48.2 instead of sdks/v2.48.0 because of incorrect commit attached to the release tag sdks/v2.48.0.
Highlights
- "Experimental" annotation cleanup: the annotation and concept have been removed from Beam to avoid
the misperception of code as "not ready". Any proposed breaking changes will be subject to
case-by-case pro/con decision making (and generally avoided) rather than using the "Experimental"
to allow them.
I/Os
- Added rename for GCS and copy for local filesystem (Go) (#25779).
- Added support for enhanced fan-out in KinesisIO.Read (Java) (#19967).
- This change is not compatible with Flink savepoints created by Beam 2.46.0 applications which had KinesisIO sources.
- Added textio.ReadWithFilename transform (Go) (#25812).
- Added fileio.MatchContinuously transform (Go) (#26186).
New Features / Improvements
- Allow passing service name for google-cloud-profiler (Python) (#26280).
- Dead letter queue support added to RunInference in Python (#24209).
- Support added for defining pre/postprocessing operations on the RunInference transform (#26308)
- Adds a Docker Compose based transform service that can be used to discover and use portable Beam transforms (#26023).
Breaking Changes
- Passing a tag into MultiProcessShared is now required in the Python SDK (#26168).
- CloudDebuggerOptions is removed (deprecated in Beam v2.47.0) for Dataflow runner as the Google Cloud Debugger service is shutting down. (Java) (#25959).
- AWS 2 client providers (deprecated in Beam v2.38.0) are finally removed (#26681).
- AWS 2 SnsIO.writeAsync (deprecated in Beam v2.37.0 due to risk of data loss) was finally removed (#26710).
- AWS 2 coders (deprecated in Beam v2.43.0 when adding Schema support for AWS Sdk Pojos) are finally removed (#23315).
Bugfixes
- Fixed Java bootloader failing with Too Long Args due to long classpaths, with a pathing jar. (Java) (#25582).
List of Contributors
According to git shortlog, the following people contributed to the 2.48.0 release. Thank you to all contributors!
Abzal Tuganbay
Ahmed Abualsaud
Alexey Romanenko
Anand Inguva
Andrei Gurau
Andrey Devyatkin
Balázs Németh
Bazyli Polednia
Bruno Volpato
Chamikara Jayalath
Clay Johnson
Damon
Daniel Arn
Danny McCormick
Darkhan Nausharipov
Dip Patel
Dmitry Repin
George Novitskiy
Israel Herraiz
Jack Dingilian
Jack McCluskey
Jan Lukavský
Jasper Van den Bossche
Jeff Zhang
Jeremy Edwards
Johanna Öjeling
John Casey
Katie Liu
Kenneth Knowles
Kerry Donny-Clark
Kuba Rauch
Liam Miller-Cushon
MakarkinSAkvelon
Mattie Fu
Michel Davit
Moritz Mack
Nick Li
Oleh Borysevych
Pablo Estrada
Pranav Bhandari
Pranjal Joshi
Rebecca Szper
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Rouslan
RuiLong J
RyujiTamaki
Sam Whittle
Sanil Jain
Svetak Sundhar
Timur Sultanov
Tony Tang
Udi Meiri
Valentyn Tymofieiev
Vishal Bhise
Vitaly Terentyev
Xinyu Liu
Yi Hu
bullet03
darshan-sj
kellen
liferoad
mokamoka03210120
psolomin
Beam 2.47.0 release
We are happy to present the new 2.47.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.47.0, check out the detailed release notes.
Highlights
- Apache Beam adds Python 3.11 support (#23848).
I/Os
- BigQuery Storage Write API is now available in Python SDK via cross-language (#21961).
- Added HbaseIO support for writing RowMutations (ordered by rowkey) to Hbase (Java) (#25830).
- Added fileio transforms MatchFiles, MatchAll and ReadMatches (Go) (#25779).
- Add integration test for JmsIO + fix issue with multiple connections (Java) (#25887).
New Features / Improvements
- The Flink runner now supports Flink 1.16.x (#25046).
- Schema'd PTransforms can now be directly applied to Beam dataframes just like PCollections.
(Note that when doing multiple operations, it may be more efficient to explicitly chain the operations
likedf | (Transform1 | Transform2 | ...)
to avoid excessive conversions.) - The Go SDK adds new transforms periodic.Impulse and periodic.Sequence that extends support
for slowly updating side input patterns. (#23106) - Python SDK now supports
protobuf <4.23.0
(#24599) - Several Google client libraries in Python SDK dependency chain were updated to latest available major versions. (#24599)
Breaking Changes
- If a main session fails to load, the pipeline will now fail at worker startup. (#25401).
- Python pipeline options will now ignore unparsed command line flags prefixed with a single dash. (#25943).
- The SmallestPerKey combiner now requires keyword-only arguments for specifying optional parameters, such as
key
andreverse
. (#25888).
Deprecations
- Cloud Debugger support and its pipeline options are deprecated and will be removed in the next Beam version,
in response to the Google Cloud Debugger service turning down.
(Java) (#25959).
Bugfixes
- BigQuery sink in STORAGE_WRITE_API mode in batch pipelines might result in data consistency issues during the handling of other unrelated transient errors for Beam SDKs 2.35.0 - 2.46.0 (inclusive). For more details see: #26521
List of Contributors
According to git shortlog, the following people contributed to the 2.47.0 release. Thank you to all contributors!
Ahmed Abualsaud
Ahmet Altay
Alexey Romanenko
Amir Fayazi
Amrane Ait Zeouay
Anand Inguva
Andrew Pilloud
Andrey Kot
Bjorn Pedersen
Bruno Volpato
Buqian Zheng
Chamikara Jayalath
ChangyuLi28
Damon
Danny McCormick
Dmitry Repin
George Ma
Jack Dingilian
Jack McCluskey
Jasper Van den Bossche
Jeremy Edwards
Jiangjie (Becket) Qin
Johanna Öjeling
Juta Staes
Kenneth Knowles
Kyle Weaver
Mattie Fu
Moritz Mack
Nick Li
Oleh Borysevych
Pablo Estrada
Rebecca Szper
Reuven Lax
Reza Rokni
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Saadat Su
Saifuddin53
Sam Rohde
Shubham Krishna
Svetak Sundhar
Theodore Ni
Thomas Gaddy
Timur Sultanov
Udi Meiri
Valentyn Tymofieiev
Xinyu Liu
Yanan Hao
Yi Hu
Yuvi Panda
andres-vv
bochap
dannikay
darshan-sj
dependabot[bot]
harrisonlimh
hnnsgstfssn
jrmccluskey
liferoad
tvalentyn
xianhualiu
zhangskz
Beam 2.46.0 release
We are happy to present the new 2.46.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
For more information on changes in 2.46.0, check out the detailed release notes.
Highlights
- Java SDK containers migrated to Eclipse Temurin
as a base. This change migrates away from the deprecated OpenJDK
container. Eclipse Temurin is currently based upon Ubuntu 22.04 while the OpenJDK
container was based upon Debian 11. - RunInference PTransform will accept model paths as SideInputs in Python SDK. (#24042)
- RunInference supports ONNX runtime in Python SDK (#22972)
- Tensorflow Model Handler for RunInference in Python SDK (#25366)
- Java SDK modules migrated to use
:sdks:java:extensions:avro
(#24748)
I/Os
- Added in JmsIO a retry policy for failed publications (Java) (#24971).
- Support for
LZMA
compression/decompression of text files added to the Python SDK (#25316) - Added ReadFrom/WriteTo Csv/Json as top-level transforms to the Python SDK.
New Features / Improvements
- Add UDF metrics support for Samza portable mode.
- Option for SparkRunner to avoid the need of SDF output to fit in memory (#23852).
This helps e.g. with ParquetIO reads. Turn the feature on by adding experimentuse_bounded_concurrent_output_for_sdf
. - Add
WatchFilePattern
transform, which can be used as a side input to the RunInference PTransfrom to watch for model updates using a file pattern. (#24042) - Add support for loading TorchScript models with
PytorchModelHandler
. The TorchScript model path can be
passed to PytorchModelHandler usingtorch_script_model_path=<path_to_model>
. (#25321) - The Go SDK now requires Go 1.19 to build. (#25545)
- The Go SDK now has an initial native Go implementation of a portable Beam Runner called Prism. (#24789)
- For more details and current state see https://github.com/apache/beam/tree/master/sdks/go/pkg/beam/runners/prism.
Breaking Changes
- The deprecated SparkRunner for Spark 2 (see 2.41.0) was removed (#25263).
- Python's BatchElements performs more aggressive batching in some cases,
capping at 10 second rather than 1 second batches by default and excluding
fixed cost in this computation to better handle cases where the fixed cost
is larger than a single second. To get the old behavior, one can pass
target_batch_duration_secs_including_fixed_cost=1
to BatchElements.
Deprecations
- Avro related classes are deprecated in module
beam-sdks-java-core
and will be eventually removed. Please, migrate to a new modulebeam-sdks-java-extensions-avro
instead by importing the classes fromorg.apache.beam.sdk.extensions.avro
package.
For the sake of migration simplicity, the relative package path and the whole class hierarchy of Avro related classes in new module is preserved the same as it was before.
For example, importorg.apache.beam.sdk.extensions.avro.coders.AvroCoder
class instead oforg.apache.beam.sdk.coders.AvroCoder
. (#24749).
List of Contributors
According to git shortlog, the following people contributed to the 2.46.0 release. Thank you to all contributors!
Ahmet Altay
Alan Zhang
Alexey Romanenko
Amrane Ait Zeouay
Anand Inguva
Andrew Pilloud
Brian Hulette
Bruno Volpato
Byron Ellis
Chamikara Jayalath
Damon
Danny McCormick
Darkhan Nausharipov
David Katz
Dmitry Repin
Doug Judd
Egbert van der Wal
Elizaveta Lomteva
Evan Galpin
Herman Mak
Jack McCluskey
Jan Lukavský
Johanna Öjeling
John Casey
Jozef Vilcek
Junhao Liu
Juta Staes
Katie Liu
Kiley Sok
Liam Miller-Cushon
Luke Cwik
Moritz Mack
Ning Kang
Oleh Borysevych
Pablo E
Pablo Estrada
Reuven Lax
Ritesh Ghorse
Robert Bradshaw
Robert Burke
Ruslan Altynnikov
Ryan Zhang
Sam Rohde
Sam Whittle
Sam sam
Sergei Lilichenko
Shivam
Shubham Krishna
Theodore Ni
Timur Sultanov
Tony Tang
Vachan
Veronica Wasson
Vincent Devillers
Vitaly Terentyev
William Ross Morrow
Xinyu Liu
Yi Hu
ZhengLin Li
Ziqi Ma
ahmedabu98
alexeyinkin
aliftadvantage
bullet03
dannikay
darshan-sj
dependabot[bot]
johnjcasey
kamrankoupayi
kileys
liferoad
nancyxu123
nickuncaged1201
pablo rodriguez defino
tvalentyn
xqhu