Releases · MobileTeleSystems/onetl

08 Dec 10:06

github-actions

0.15.0

4e8b6e9

0.15.0 (2025-12-08) Latest

Latest

Removals

Drop Teradata connector. It is not used in our company anymore, and never had proper integration tests.

Breaking Changes

Add Iceberg(catalog=..., warehouse=...) mandatory options (#391, #393, #394, #397, #399, #413).

In 0.14.0 we've implemented very basic Iceberg connector configured via dictionary:

iceberg = Iceberg(
    catalog_name="mycatalog",
    extra={
        "type": "rest",
        "uri": "https://catalog.company.com/rest",
        "rest.auth.type": "oauth2",
        "token": "jwt_token",
        "warehouse": "s3a://mybucket/",
        "io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
        "s3.endpoint": "http://localhost:9010",
        "s3.access-key-id": "access_key",
        "s3.secret-access-key": "secret_key",
        "s3.path-style-access": "true",
        "client.region": "us-east-1",
    },
    spark=spark,
)

Now we've implemented wrapper classes allowing to configure various Iceberg catalogs:

iceberg = Iceberg(
    catalog_name="mycatalog",
    catalog=Iceberg.RESTCatalog(
        url="https://catalog.company.com/rest",
        auth=Iceberg.RESTCatalog.BearerAuth(
            access_token="jwt_token",
        ),
    ),
    warehouse=...,
)

iceberg = Iceberg(
    catalog_name="mycatalog",
    catalog=Iceberg.RESTCatalog(
        url="https://catalog.company.com/rest",
        auth=Iceberg.RESTCatalog.OAuth2ClientCredentials(
            client_id="my_client",
            client_secret="my_secret",
            oauth2_token_endpoint="http://keycloak.company.com/realms/my-realm/protocol/openid-connect/token",
            scopes=["catalog"],
        ),
    ),
    warehouse=...,
    spark=spark,
)

And also set of classes to configure for warehouses:

iceberg = Iceberg(
    catalog_name="mycatalog",
    catalog=...,
    # using Iceberg AWS integration
    warehouse=Iceberg.S3Warehouse(
        path="/",
        bucket="mybucket",
        host="localhost",
        port=9010,
        protocol="http",
        path_style_access=True,
        access_key="access_key",
        secret_key="secret_key",
        region="us-east-1",
    ),
    spark=spark,
)

iceberg = Iceberg(
    catalog_name="mycatalog",
    catalog=...,
    # Delegate warehouse config to REST Catalog
    warehouse=Iceberg.DelegatedWarehouse(
        warehouse="some-warehouse",
        access_delegation="vended-credentials",
    ),
    spark=spark,
)

iceberg = Iceberg(
    catalog_name="mycatalog",
    # store both data and metadata on HadoopFilesystem
    catalog=Iceberg.FilesystemCatalog(),
    warehouse=Iceberg.FilesystemWarehouse(
        path="/some/warehouse",
        connection=SparkHDFS(cluster="dwh"),
    ),
    spark=spark,
)

Having classes instead of dicts brings IDE autocompletion, and allows to reuse the same catalog connection options for multiple warehouses.

Features

Added support for Iceberg.WriteOptions(table_properties={}) (#401).

In particular, table's "location": "/some/warehouse/mytable" can be set now.
Added support for Hive.WriteOptions(table_properties={}) (#412).

In particular, table's "auto.purge": "true" can be set now.

Improvements

Allow to set SparkS3(path_style_access=True) instead of SparkS3(extra={"path.style.access": True) (#392).

This change improves IDE autocompletion and made it more explicit that the parameter is important for the connector's functionality.
Add a runtime warning about missing S3(region=...) and SparkS3(region=...) params (#418).

It is recommended to explicitly pass this parameter to avoid potential access errors.

Thanks to @yabel

Dependencies

Update JDBC connectors:
- MySQL 9.4.0 → 9.5.0
- MSSQL 13.2.0 → 13.2.1
- Oracle 23.9.0.25.07 → 23.26.0.0.0
- Postgres 42.7.7 → 42.7.8
Added support for Clickhouse.get_packages(package_version="0.9.3") (#407).

Versions in range 0.8.0-0.9.2 are not supported due to issue #2625.

Versions 0.9.3+ is still not default one because of various compatibility and performance issues. Use it at your own risk.

Documentation

Document using Greenplum connector with Spark on master=k8s

Contributors

yabel

Assets 4

25 Nov 08:22

github-actions

0.14.1

219ad9e

0.14.1 (2025-11-25)

Dependencies

Release minio==7.2.19 lead to broken S3 connector with errors like these:

TypeError: Minio.fget_object() takes 1 positional argument but 3 were given
TypeError: Minio.fput_object() takes 1 positional argument but 3 were given

Fixed.

Added limit minio<8.0 to avoid breaking things in next major release.

Assets 4

08 Sep 18:32

github-actions

0.14.0

ba410dd

0.14.0 (2025-09-08)

Features

Add Spark 4.0 support. (#297)
Add Iceberg connection support. For now this is alpha version, and behavior may change in future. (#378, #386)

Breaking Changes

Drop Spark 2 support. Minimal supported Spark version is 3.2. (#383)

Also dropped:
- Greenplum.package_spark_2_3
- Greenplum.package_spark_2_4
Update DB connectors/drivers to latest versions:
- MongoDB 10.4.1 → 10.5.0
- MySQL 9.2.0 → 9.4.0
- MSSQL 12.8.10 → 13.2.0
- Oracle 23.7.0.25.01 → 23.9.0.25.07
- Postgres 42.7.5 → 42.7.7
Update Excel package name from com.crealytics:spark-excel to dev.mauch:spark-excel. (#382)
Now Excel.get_packages(package_version=...) parameter is mandatory. (#382)

Improvements

Return full file/directory path from FileConnection.list_dir and FileConnection.walk. (#381)
Previously these methods returned only file names.
Speed up removing S3 and Samba directories with recursive=True. (#380)

Bug fixes

Treat S3 objects with names ending with a / slash as directory marker. (#379)

Assets 4

17 Apr 08:08

github-actions

0.13.5

1f9ff71

0.13.5 (2025-04-14)

Bug Fixes

0.13.0 changed the way Greenplum.check() is implemented - it begin checking DB availability from both Spark driver and executor. But due to misspell, SELECT queries were emitted from all available executors. This lead to opening too much connections to Greenplum, which was unexpected.

Now only one Spark executor is used to run Greenplum.check().

Assets 4

20 Mar 13:51

github-actions

0.13.4

9b16f40

0.13.4 (2025-03-20)

Doc only Changes

Prefer ReadOptions(partitionColumn=..., numPartitions=..., queryTimeout=...) instead of ReadOptions(partition_column=..., num_partitions=..., query_timeout=...), to match Spark documentation. (#352)
Prefer WriteOptions(if_exists=...) instead of WriteOptions(mode=...) for IDE suggestions. (#354)
Document all options of supported file formats. (#355, #356, #357, #358, #359, #360, #361, #362)

Assets 4

11 Mar 09:41

github-actions

0.13.3

3b6bc23

0.13.3 (2025-03-11)

Dependencies

Allow using etl-entities 2.6.0.

Assets 4

06 Mar 11:14

github-actions

0.13.1

4225330

0.13.1 (2025-03-06)

Bug Fixes

In 0.13.0, using DBWriter(connection=hive, target="SOMEDB.SOMETABLE") lead to executing df.write.saveAsTable() instead of df.write.insertInto() if target table somedb.sometable already exist.

This is caused by table name normalization (Hive uses lower-case names), which wasn't properly handled by method used for checking table existence. (#350)

Warning

0.13.0 release is yanked from PyPI

Assets 4

23 Feb 21:18

github-actions

0.13.0

12d95d8

0.13.0 (2025-02-24)

🎉 3 years since first release 0.1.0 🎉

Warning

0.13.0 release is yanked from PyPI due to bug. Please upgrade to 0.13.1

Breaking Changes

Add Python 3.13. support. (#298)
Change the logic of FileConnection.walk and FileConnection.list_dir. (#327)

Previously limits.stops_at(path) == True considered as "return current file and stop", and could lead to exceeding some limit. Not it means "stop immediately".
Change default value for FileDFWriter.Options(if_exists=...) from error to append, to make it consistent with other .Options() classes within onETL. (#343)

Features

Add support for FileModifiedTimeHWM HWM class (see etl-entities 2.5.0):

from etl_entitites.hwm import FileModifiedTimeHWM
from onetl.file import FileDownloader
from onetl.strategy import IncrementalStrategy

downloader = FileDownloader(
    ...,
    hwm=FileModifiedTimeHWM(name="somename"),
)

with IncrementalStrategy():
    downloader.run()

Introduce FileSizeRange(min=..., max=...) filter class. (#325)

Now users can set FileDownloader / FileMover to download/move only files with specific file size range:

from onetl.file import FileDownloader
from onetl.file.filter import FileSizeRange

downloader = FileDownloader(
    ...,
    filters=[FileSizeRange(min="10KiB", max="1GiB")],
)

Introduce TotalFilesSize(...) limit class. (#326)

Now users can set FileDownloader / FileMover to stop downloading/moving files after reaching a certain amount of data:

from datetime import datetime, timedelta
from onetl.file import FileDownloader
from onetl.file.limit import TotalFilesSize

downloader = FileDownloader(
    ...,
    limits=[TotalFilesSize("1GiB")],
)

Implement FileModifiedTime(since=..., until=...) file filter. (#330)

Now users can set FileDownloader / FileMover to download/move only files with specific file modification time:

from datetime import datetime, timedelta
from onetl.file import FileDownloader
from onetl.file.filter import FileModifiedTime

downloader = FileDownloader(
    ...,
    filters=[FileModifiedTime(before=datetime.now() - timedelta(hours=1))],
)

Add SparkS3.get_exclude_packages() and Kafka.get_exclude_packages() methods. (#341)

Using them allows to skip downloading dependencies not required by this specific connector, or which are already a part of Spark/PySpark:

from onetl.connection import SparkS3, Kafka

maven_packages = [
    *SparkS3.get_packages(spark_version="3.5.4"),
    *Kafka.get_packages(spark_version="3.5.4"),
]
exclude_packages = SparkS3.get_exclude_packages() + Kafka.get_exclude_packages()
spark = (
    SparkSession.builder.appName("spark_app_onetl_demo")
    .config("spark.jars.packages", ",".join(maven_packages))
    .config("spark.jars.excludes", ",".join(exclude_packages))
    .getOrCreate()
)

Improvements

All DB connections opened by JDBC.fetch(...), JDBC.execute(...) or JDBC.check() are immediately closed after the statements is executed. (#334)

Previously Spark session with master=local[3] actually opened up to 5 connections to target DB - one for JDBC.check(), another for Spark driver interaction with DB to create tables, and one for each Spark executor. Now only max 4 connections are opened, as JDBC.check() does not hold opened connection.

This is important for RDBMS like Postgres or Greenplum where number of connections is strictly limited and limit is usually quite low.
Set up ApplicationName (client info) for Clickhouse, MongoDB, MSSQL, MySQL and Oracle. (#339, #248)

Also update ApplicationName format for Greenplum, Postgres, Kafka and SparkS3. Now all connectors have the same ApplicationName format: ${spark.applicationId} ${spark.appName} onETL/${onetl.version} Spark/${spark.version}

The only connections not sending ApplicationName are Teradata and FileConnection implementations.
Now DB.check() will test connection availability not only on Spark driver, but also from some Spark executor. (#346)

This allows to fail immediately if Spark driver host has network access to target DB, but Spark executors have not.

Note

Now Greenplum.check() requires the same user grants as DBReader(connection=greenplum):

-- yes, "writable", it's not a mistake
ALTER USER username CREATEEXTTABLE(type = 'writable', protocol = 'gpfdist');

-- for both reading and writing to GP
-- ALTER USER username CREATEEXTTABLE(type = 'readable', protocol = 'gpfdist') CREATEEXTTABLE(type = 'writable', protocol = 'gpfdist');

Please ask your Greenplum administrators to provide these grants.

Bug Fixes

Avoid suppressing Hive Metastore errors while using DBWriter. (#329)

Previously this was implemented as:
```
try:
    spark.sql(f"SELECT * FROM {table}")
    table_exists = True
except Exception:
    table_exists = False
```
If Hive Metastore was overloaded and responded with an exception, it was considered as non-existing table, resulting to full table override instead of append or override only partitions subset.
Fix using onETL to write data to PostgreSQL or Greenplum instances behind pgbouncer with pool_mode=transaction. (#336)

Previously Postgres.check() opened a read-only transaction, pgbouncer changed the entire connection type from read-write to read-only, and when DBWriter.run(df) executed in read-only connection, producing errors like:
```
org.postgresql.util.PSQLException: ERROR: cannot execute INSERT in a read-only transaction
org.postgresql.util.PSQLException: ERROR: cannot execute TRUNCATE TABLE in a read-only transaction
```
Added a workaround by passing readOnly=True to JDBC params for read-only connections, so pgbouncer may differ read-only and read-write connections properly.

After upgrading onETL 0.13.x or higher the same error still may appear of pgbouncer still holds read-only connections and returns them for DBWriter. To this this, user can manually convert read-only connection to read-write:
```
postgres.execute("BEGIN READ WRITE;")  # <-- add this line
DBWriter(...).run()
```
After all connections in pgbouncer pool were converted from read-only to read-write, and error fixed, this additional line could be removed.

See Postgres JDBC driver documentation.
Fix MSSQL.fetch(...) and MySQL.fetch(...) opened a read-write connection instead of read-only. (#337)
- Now this is fixed:
  - MSSQL.fetch(...) establishes connection with ApplicationIntent=ReadOnly.
  - MySQL.fetch(...) calls SET SESSION TRANSACTION READ ONLY statement.
Fixed passing multiple filters to FileDownloader and FileMover. (#338) If was caused by sorting filters list in internal logging method, but FileFilter subclasses are not sortable.
Fix a false warning about a lof of parallel connections to Grenplum. (#342)

Creating Spark session with .master("local[5]") may open up to 6 connections to Greenplum (=number of Spark executors + 1 for driver), but onETL instead used number of CPU cores on the host as a number of parallel connections.

This lead to showing a false warning that number of Greenplum connections is too high, which actually should be the case only if number of executors is higher than 30.

Fix MongoDB trying to use current database name as authSource. (#347)

Use default connector value which is admin database. Previous onETL versions could be fixed by:

from onetl.connection import MongoDB

mongodb = MongoDB(
    ...,
    database="mydb",
    extra={
        "authSource": "admin",
    },
)

Dependencies

Minimal etl-entities version is now 2.5.0. (#331)
- Update DB connectors/drivers to latest versions: (#345)
  - Clickhouse 0.6.5 → 0.7.2
  - MongoDB 10.4.0 → 10.4.1
  - MySQL 9.0.0 → 9.2.0
  - Oracle 23.5.0.24.07 → 23.7.0.25.01
  - Postgres 42.7.4 → 42.7.5

Doc only Changes

Split large code examples to tabs. (#344)

Assets 4

03 Dec 09:32

github-actions

0.12.5

57754e4

0.12.5 (2024-12-03)

Improvements

Use sipHash64 instead of md5 in Clickhouse for reading data with {"partitioning_mode": "hash"}, as it is 5 times faster.
Use hashtext instead of md5 in Postgres for reading data with {"partitioning_mode": "hash"}, as it is 3-5 times faster.
Use BINARY_CHECKSUM instead of HASHBYTES in MSSQL for reading data with {"partitioning_mode": "hash"}, as it is 5 times faster.

Big fixes

In JDBC sources wrap MOD(partitionColumn, numPartitions) with ABS(...) to make al returned values positive. This prevents data skew.
Fix reading table data from MSSQL using {"partitioning_mode": "hash"} with partitionColumn of integer type.
Fix reading table data from Postgres using {"partitioning_mode": "hash"} lead to data skew (all the data was read into one Spark partition).

Assets 4

27 Nov 12:37

github-actions

0.12.4

694a71a

0.12.4 (2024-11-27)

Bug Fixes

Fix DBReader(conn=oracle, options={"partitioning_mode": "hash"}) lead to data skew in last partition due to wrong ora_hash usage. (#319)

Assets 4

Releases: MobileTeleSystems/onetl

0.15.0 (2025-12-08)

Removals

Breaking Changes

Features

Improvements

Dependencies

Documentation

Contributors

Uh oh!

0.14.1 (2025-11-25)

Dependencies

Uh oh!

0.14.0 (2025-09-08)

Features

Breaking Changes

Improvements

Bug fixes

Uh oh!

0.13.5 (2025-04-14)

Bug Fixes

Uh oh!

0.13.4 (2025-03-20)

Doc only Changes

Uh oh!

0.13.3 (2025-03-11)

Dependencies

Uh oh!

0.13.1 (2025-03-06)

Bug Fixes

Uh oh!

0.13.0 (2025-02-24)

Breaking Changes

Features

Improvements

Bug Fixes

Dependencies

Doc only Changes

Uh oh!

0.12.5 (2024-12-03)

Improvements

Big fixes

Uh oh!

0.12.4 (2024-11-27)

Bug Fixes

Uh oh!