Skip to content

Performance issue of azure.datalake.store.core.AzureDLFile.write() for Cosmos #319

Open
@zjw0304

Description

@zjw0304

Description

I ran benchmark to compare the performance to upload data to Cosmos between AzureDLFile.write() and the write API from native libhdfs.so. The result show there is significant gap. To write same amount data to Cosmos, the time used by azure data lake store is more than double of the time HDFS used. I also checked the network throughput, with HDFS we can push it to about 4Gb. And for ADL the throughput is only reach to 1.3Gb.

In my testing, I used multiple thread to write the data. Each thread creates individual file and write data into it. I tried to increase the thread number and the buffer size. it didn't help to improve the performance.

My questions are:

  1. Is this performance gap expected? Since azure data lake store is based on REST API.
  2. Is there any advanced API or parameter I can try to improve the throughput? For my scenario, we have to use the streaming write API to upload the data.

Environment summary

SDK Version: What version of the SDK are you using? (pip show azure-datalake-store)
Answer here: The latest.

Python Version: What Python version are you using? Is it 64-bit or 32-bit?
Answer here: python version: 3.6.9 64

OS Version: What OS and version are you using?
Answer here: Ubuntu 18.04

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions