Skip to content

Rust Engine write_deltalake Schema #2200

@tastyfrankfurt

Description

@tastyfrankfurt

Environment

Ubuntu 22.04, python3, reading events from Azure eventhubs and writing out to Azure blob storage.
Delta-rs version:
pip list returns version 0.15.3 for delta lake

Binding:

Environment:

  • loud provider: Azure
  • OS: Ubuntu
  • Other: python

Bug

What happened:
When using schema option on the write_deltalake function with the rust engine it fails to write out to delta lake on azure storage.

If i change to pyarrow engine then it works as expected.

What you expected to happen:
When using schema option on the write_deltalake function with the rust engine it writes out to delta lake on azure storage.

How to reproduce it:
Use the code sample as stated below,

Apologies for my code snippet if there are any errors.

import pyarrow as pa
from azure.storage.blob import ContainerClient
from deltalake import write_deltalake,WriterProperties,DeltaTable,Schema,Field
from deltalake.schema import PrimitiveType

events = [{"appname":"AnthraX","geoip":{"long": 123,"lat":456},"facility":"ftp","hostname":"we.eus","message":"You\'re not gonna believe what just happened","msgid":"ID675","procid":5599,"severity":"notice","timestamp":"2024-02-19T12:01:40.777Z","version":1},
{"appname":"AnthraX","geoip":{"long": 123,"lat":456},"facility":"ntp","hostname":"random.rent","message":"You\'re not gonna believe what just happened","msgid":"ID898","procid":1335,"severity":"notice","timestamp":"2024-02-19T12:01:41.777Z","version":2},
{"appname":"shaneIxD","geoip":{"long": 123,"lat":456},"facility":"authpriv","hostname":"we.frogans","message":"There\'s a breach in the warp core, captain","msgid":"ID356","procid":3449,"severity":"warning","timestamp":"2024-02-19T12:01:42.778Z","version":2},
{"appname":"jesseddy","geoip":{"long": 123,"lat":456},"facility":"user","hostname":"for.is","message":"We\'re gonna need a bigger boat","msgid":"ID516","procid":445,"severity":"notice","timestamp":"2024-02-19T12:01:43.777Z","version":2},
{"appname":"shaneIxD","geoip":{"long": 123,"lat":456},"facility":"audit","hostname":"make.dad","message":"Great Scott! We\'re never gonna reach 88 mph with the flux capacitor in its current state!","msgid":"ID304","procid":2692,"severity":"err","timestamp":"2024-02-19T12:01:44.777Z","version":2},
{"appname":"b0rnc0nfused","geoip":{"long": 123,"lat":456},"facility":"local1","hostname":"up.realtor","message":"Maybe we just shouldn\'t use computers","msgid":"ID952","procid":8756,"severity":"emerg","timestamp":"2024-02-19T12:01:45.777Z","version":2},
{"appname":"KarimMove","geoip":{"long": 123,"lat":456},"facility":"local4","hostname":"some.store","message":"Maybe we just shouldn\'t use computers","msgid":"ID415","procid":9398,"severity":"crit","timestamp":"2024-02-19T12:01:46.778Z","version":2},
{"appname":"ahmadajmi","geoip":{"long": 123,"lat":456},"facility":"local1","hostname":"random.tushu","message":"Pretty pretty pretty good","msgid":"ID127","procid":8555,"severity":"info","timestamp":"2024-02-19T12:01:47.777Z","version":1},
{"appname":"b0rnc0nfused","geoip":{"long": 123,"lat":456},"facility":"alert","hostname":"names.tips","message":"There\'s a breach in the warp core, captain","msgid":"ID129","procid":5473,"severity":"debug","timestamp":"2024-02-19T12:01:48.777Z","version":2}
]
schema_json = '''{
    "type": "struct",
    "fields": [
            {"name":"timestamp","type":"datetime","nullable":false,"metadata":{}},
            {"name":"appname","type":"string","nullable":true,"metadata": {}},
            {"name":"facility","type":"string","nullable":true,"metadata": {}},
            {"name":"hostname","type":"string","nullable":true,"metadata": {}},
            {"name":"message","type":"string","nullable":true,"metadata": {}},
            {"name":"procid","type":"string","nullable":true,"metadata": {}},
            {"name":"msgid","type":"string","nullable":true,"metadata": {}},
            {"name":"severity","type":"string","nullable":true,"metadata": {}}
        ]
    }'''
schema = Schema.from_json(schema_json )
data = pa.Table.from_pylist(events)
delta_path = 'abfss://container@account_name.dfs.core.windows.net/tablepath
delta_partition = ["severity"]
deltalake_writeprops = WriterProperties(compression='LZ4')
delta_storage_options = {
            "AZURE_STORAGE_ACCOUNT_NAME": mystorageaccountname,
            "AZURE_STORAGE_ACCOUNT_KEY": mysupersecretkey
        }
write_deltalake(
            delta_path,
            data,
            partition_by=delta_partition,
            storage_options=delta_storage_options,
            writer_properties=deltalake_writeprops,
            schema=schema,
            overwrite_schema=True,
            engine="rust",
            mode="append")

More details:
BACKTRACEFULL.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions