Releases: MobileTeleSystems/onetl
0.15.0 (2025-12-08)
Removals
Drop Teradata connector. It is not used in our company anymore, and never had proper integration tests.
Breaking Changes
Add Iceberg(catalog=..., warehouse=...) mandatory options (#391, #393, #394, #397, #399, #413).
In 0.14.0 we've implemented very basic Iceberg connector configured via dictionary:
iceberg = Iceberg(
catalog_name="mycatalog",
extra={
"type": "rest",
"uri": "https://catalog.company.com/rest",
"rest.auth.type": "oauth2",
"token": "jwt_token",
"warehouse": "s3a://mybucket/",
"io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
"s3.endpoint": "http://localhost:9010",
"s3.access-key-id": "access_key",
"s3.secret-access-key": "secret_key",
"s3.path-style-access": "true",
"client.region": "us-east-1",
},
spark=spark,
)Now we've implemented wrapper classes allowing to configure various Iceberg catalogs:
iceberg = Iceberg(
catalog_name="mycatalog",
catalog=Iceberg.RESTCatalog(
url="https://catalog.company.com/rest",
auth=Iceberg.RESTCatalog.BearerAuth(
access_token="jwt_token",
),
),
warehouse=...,
)iceberg = Iceberg(
catalog_name="mycatalog",
catalog=Iceberg.RESTCatalog(
url="https://catalog.company.com/rest",
auth=Iceberg.RESTCatalog.OAuth2ClientCredentials(
client_id="my_client",
client_secret="my_secret",
oauth2_token_endpoint="http://keycloak.company.com/realms/my-realm/protocol/openid-connect/token",
scopes=["catalog"],
),
),
warehouse=...,
spark=spark,
)And also set of classes to configure for warehouses:
iceberg = Iceberg(
catalog_name="mycatalog",
catalog=...,
# using Iceberg AWS integration
warehouse=Iceberg.S3Warehouse(
path="/",
bucket="mybucket",
host="localhost",
port=9010,
protocol="http",
path_style_access=True,
access_key="access_key",
secret_key="secret_key",
region="us-east-1",
),
spark=spark,
)iceberg = Iceberg(
catalog_name="mycatalog",
catalog=...,
# Delegate warehouse config to REST Catalog
warehouse=Iceberg.DelegatedWarehouse(
warehouse="some-warehouse",
access_delegation="vended-credentials",
),
spark=spark,
)iceberg = Iceberg(
catalog_name="mycatalog",
# store both data and metadata on HadoopFilesystem
catalog=Iceberg.FilesystemCatalog(),
warehouse=Iceberg.FilesystemWarehouse(
path="/some/warehouse",
connection=SparkHDFS(cluster="dwh"),
),
spark=spark,
)Having classes instead of dicts brings IDE autocompletion, and allows to reuse the same catalog connection options for multiple warehouses.
Features
-
Added support for
Iceberg.WriteOptions(table_properties={})(#401).In particular, table's
"location": "/some/warehouse/mytable"can be set now. -
Added support for
Hive.WriteOptions(table_properties={})(#412).In particular, table's
"auto.purge": "true"can be set now.
Improvements
-
Allow to set
SparkS3(path_style_access=True)instead ofSparkS3(extra={"path.style.access": True)(#392).This change improves IDE autocompletion and made it more explicit that the parameter is important for the connector's functionality.
-
Add a runtime warning about missing
S3(region=...)andSparkS3(region=...)params (#418).It is recommended to explicitly pass this parameter to avoid potential access errors.
Thanks to @yabel
Dependencies
-
Update JDBC connectors:
- MySQL
9.4.0→9.5.0 - MSSQL
13.2.0→13.2.1 - Oracle
23.9.0.25.07→23.26.0.0.0 - Postgres
42.7.7→42.7.8
- MySQL
-
Added support for
Clickhouse.get_packages(package_version="0.9.3")(#407).Versions in range 0.8.0-0.9.2 are not supported due to issue #2625.
Versions 0.9.3+ is still not default one because of various compatibility and performance issues. Use it at your own risk.
Documentation
- Document using Greenplum connector with Spark on
master=k8s
0.14.1 (2025-11-25)
Dependencies
Release minio==7.2.19 lead to broken S3 connector with errors like these:
TypeError: Minio.fget_object() takes 1 positional argument but 3 were given
TypeError: Minio.fput_object() takes 1 positional argument but 3 were given
Fixed.
Added limit minio<8.0 to avoid breaking things in next major release.
0.14.0 (2025-09-08)
Features
- Add Spark 4.0 support. (#297)
- Add
Icebergconnection support. For now this is alpha version, and behavior may change in future. (#378, #386)
Breaking Changes
-
Drop Spark 2 support. Minimal supported Spark version is 3.2. (#383)
Also dropped:
Greenplum.package_spark_2_3Greenplum.package_spark_2_4
-
Update DB connectors/drivers to latest versions:
- MongoDB
10.4.1→10.5.0 - MySQL
9.2.0→9.4.0 - MSSQL
12.8.10→13.2.0 - Oracle
23.7.0.25.01→23.9.0.25.07 - Postgres
42.7.5→42.7.7
- MongoDB
-
Update Excel package name from
com.crealytics:spark-exceltodev.mauch:spark-excel. (#382) -
Now
Excel.get_packages(package_version=...)parameter is mandatory. (#382)
Improvements
- Return full file/directory path from
FileConnection.list_dirandFileConnection.walk. (#381)
Previously these methods returned only file names. - Speed up removing S3 and Samba directories with
recursive=True. (#380)
Bug fixes
- Treat S3 objects with names ending with a
/slash as directory marker. (#379)
0.13.5 (2025-04-14)
Bug Fixes
0.13.0 changed the way Greenplum.check() is implemented - it begin checking DB availability from both Spark driver and executor. But due to misspell, SELECT queries were emitted from all available executors. This lead to opening too much connections to Greenplum, which was unexpected.
Now only one Spark executor is used to run Greenplum.check().
0.13.4 (2025-03-20)
Doc only Changes
- Prefer
ReadOptions(partitionColumn=..., numPartitions=..., queryTimeout=...)instead ofReadOptions(partition_column=..., num_partitions=..., query_timeout=...), to match Spark documentation. (#352) - Prefer
WriteOptions(if_exists=...)instead ofWriteOptions(mode=...)for IDE suggestions. (#354) - Document all options of supported file formats. (#355, #356, #357, #358, #359, #360, #361, #362)
0.13.3 (2025-03-11)
Dependencies
Allow using etl-entities 2.6.0.
0.13.1 (2025-03-06)
Bug Fixes
In 0.13.0, using DBWriter(connection=hive, target="SOMEDB.SOMETABLE") lead to executing df.write.saveAsTable() instead of df.write.insertInto() if target table somedb.sometable already exist.
This is caused by table name normalization (Hive uses lower-case names), which wasn't properly handled by method used for checking table existence. (#350)
Warning
0.13.0 release is yanked from PyPI
0.13.0 (2025-02-24)
🎉 3 years since first release 0.1.0 🎉
Warning
0.13.0 release is yanked from PyPI due to bug. Please upgrade to 0.13.1
Breaking Changes
-
Add Python 3.13. support. (#298)
-
Change the logic of
FileConnection.walkandFileConnection.list_dir. (#327)Previously
limits.stops_at(path) == Trueconsidered as "return current file and stop", and could lead to exceeding some limit. Not it means "stop immediately". -
Change default value for
FileDFWriter.Options(if_exists=...)fromerrortoappend, to make it consistent with other.Options()classes within onETL. (#343)
Features
-
Add support for
FileModifiedTimeHWMHWM class (see etl-entities 2.5.0):from etl_entitites.hwm import FileModifiedTimeHWM from onetl.file import FileDownloader from onetl.strategy import IncrementalStrategy downloader = FileDownloader( ..., hwm=FileModifiedTimeHWM(name="somename"), ) with IncrementalStrategy(): downloader.run()
-
Introduce
FileSizeRange(min=..., max=...)filter class. (#325)Now users can set
FileDownloader/FileMoverto download/move only files with specific file size range:from onetl.file import FileDownloader from onetl.file.filter import FileSizeRange downloader = FileDownloader( ..., filters=[FileSizeRange(min="10KiB", max="1GiB")], )
-
Introduce
TotalFilesSize(...)limit class. (#326)Now users can set
FileDownloader/FileMoverto stop downloading/moving files after reaching a certain amount of data:from datetime import datetime, timedelta from onetl.file import FileDownloader from onetl.file.limit import TotalFilesSize downloader = FileDownloader( ..., limits=[TotalFilesSize("1GiB")], )
-
Implement
FileModifiedTime(since=..., until=...)file filter. (#330)Now users can set
FileDownloader/FileMoverto download/move only files with specific file modification time:from datetime import datetime, timedelta from onetl.file import FileDownloader from onetl.file.filter import FileModifiedTime downloader = FileDownloader( ..., filters=[FileModifiedTime(before=datetime.now() - timedelta(hours=1))], )
-
Add
SparkS3.get_exclude_packages()andKafka.get_exclude_packages()methods. (#341)Using them allows to skip downloading dependencies not required by this specific connector, or which are already a part of Spark/PySpark:
from onetl.connection import SparkS3, Kafka maven_packages = [ *SparkS3.get_packages(spark_version="3.5.4"), *Kafka.get_packages(spark_version="3.5.4"), ] exclude_packages = SparkS3.get_exclude_packages() + Kafka.get_exclude_packages() spark = ( SparkSession.builder.appName("spark_app_onetl_demo") .config("spark.jars.packages", ",".join(maven_packages)) .config("spark.jars.excludes", ",".join(exclude_packages)) .getOrCreate() )
Improvements
-
All DB connections opened by
JDBC.fetch(...),JDBC.execute(...)orJDBC.check()are immediately closed after the statements is executed. (#334)Previously Spark session with
master=local[3]actually opened up to 5 connections to target DB - one forJDBC.check(), another for Spark driver interaction with DB to create tables, and one for each Spark executor. Now only max 4 connections are opened, asJDBC.check()does not hold opened connection.This is important for RDBMS like Postgres or Greenplum where number of connections is strictly limited and limit is usually quite low.
-
Set up
ApplicationName(client info) for Clickhouse, MongoDB, MSSQL, MySQL and Oracle. (#339, #248)Also update
ApplicationNameformat for Greenplum, Postgres, Kafka and SparkS3. Now all connectors have the sameApplicationNameformat:${spark.applicationId} ${spark.appName} onETL/${onetl.version} Spark/${spark.version}The only connections not sending
ApplicationNameare Teradata and FileConnection implementations. -
Now
DB.check()will test connection availability not only on Spark driver, but also from some Spark executor. (#346)This allows to fail immediately if Spark driver host has network access to target DB, but Spark executors have not.
Note
Now Greenplum.check() requires the same user grants as DBReader(connection=greenplum):
-- yes, "writable", it's not a mistake
ALTER USER username CREATEEXTTABLE(type = 'writable', protocol = 'gpfdist');
-- for both reading and writing to GP
-- ALTER USER username CREATEEXTTABLE(type = 'readable', protocol = 'gpfdist') CREATEEXTTABLE(type = 'writable', protocol = 'gpfdist');Please ask your Greenplum administrators to provide these grants.
Bug Fixes
-
Avoid suppressing Hive Metastore errors while using
DBWriter. (#329)Previously this was implemented as:
try: spark.sql(f"SELECT * FROM {table}") table_exists = True except Exception: table_exists = False
If Hive Metastore was overloaded and responded with an exception, it was considered as non-existing table, resulting to full table override instead of append or override only partitions subset.
-
Fix using onETL to write data to PostgreSQL or Greenplum instances behind pgbouncer with
pool_mode=transaction. (#336)Previously
Postgres.check()opened a read-only transaction, pgbouncer changed the entire connection type from read-write to read-only, and whenDBWriter.run(df)executed in read-only connection, producing errors like:org.postgresql.util.PSQLException: ERROR: cannot execute INSERT in a read-only transaction org.postgresql.util.PSQLException: ERROR: cannot execute TRUNCATE TABLE in a read-only transactionAdded a workaround by passing
readOnly=Trueto JDBC params for read-only connections, so pgbouncer may differ read-only and read-write connections properly.After upgrading onETL 0.13.x or higher the same error still may appear of pgbouncer still holds read-only connections and returns them for DBWriter. To this this, user can manually convert read-only connection to read-write:
postgres.execute("BEGIN READ WRITE;") # <-- add this line DBWriter(...).run()
After all connections in pgbouncer pool were converted from read-only to read-write, and error fixed, this additional line could be removed.
-
Fix
MSSQL.fetch(...)andMySQL.fetch(...)opened a read-write connection instead of read-only. (#337)-
Now this is fixed:
MSSQL.fetch(...)establishes connection withApplicationIntent=ReadOnly.MySQL.fetch(...)callsSET SESSION TRANSACTION READ ONLYstatement.
-
-
Fixed passing multiple filters to
FileDownloaderandFileMover. (#338) If was caused by sorting filters list in internal logging method, butFileFiltersubclasses are not sortable. -
Fix a false warning about a lof of parallel connections to Grenplum. (#342)
Creating Spark session with
.master("local[5]")may open up to 6 connections to Greenplum (=number of Spark executors + 1 for driver), but onETL instead used number of CPU cores on the host as a number of parallel connections.This lead to showing a false warning that number of Greenplum connections is too high, which actually should be the case only if number of executors is higher than 30.
-
Fix MongoDB trying to use current database name as
authSource. (#347)Use default connector value which is
admindatabase. Previous onETL versions could be fixed by:from onetl.connection import MongoDB mongodb = MongoDB( ..., database="mydb", extra={ "authSource": "admin", }, )
Dependencies
-
-
Update DB connectors/drivers to latest versions: (#345)
- Clickhouse
0.6.5→0.7.2 - MongoDB
10.4.0→10.4.1 - MySQL
9.0.0→9.2.0 - Oracle
23.5.0.24.07→23.7.0.25.01 - Postgres
42.7.4→42.7.5
- Clickhouse
-
Doc only Changes
- Split large code examples to tabs. (#344)
0.12.5 (2024-12-03)
Improvements
- Use
sipHash64instead ofmd5in Clickhouse for reading data with{"partitioning_mode": "hash"}, as it is 5 times faster. - Use
hashtextinstead ofmd5in Postgres for reading data with{"partitioning_mode": "hash"}, as it is 3-5 times faster. - Use
BINARY_CHECKSUMinstead ofHASHBYTESin MSSQL for reading data with{"partitioning_mode": "hash"}, as it is 5 times faster.
Big fixes
- In JDBC sources wrap
MOD(partitionColumn, numPartitions)withABS(...)to make al returned values positive. This prevents data skew. - Fix reading table data from MSSQL using
{"partitioning_mode": "hash"}withpartitionColumnof integer type. - Fix reading table data from Postgres using
{"partitioning_mode": "hash"}lead to data skew (all the data was read into one Spark partition).
0.12.4 (2024-11-27)
Bug Fixes
- Fix
DBReader(conn=oracle, options={"partitioning_mode": "hash"})lead to data skew in last partition due to wrongora_hashusage. (#319)