Transfer dataframe through grpc with some efficient way
- grpcio
- grpcio-tools
- flask
- requests
- numpy
- pandas
- ujson
- orjson
- datatable
-
run
gen.py
to generate 2K, 2M and 200M files intodata/
folder -
run
python3 -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. df.proto
to generate grpc python code
- run
server.py
to start a grpc server - run
client.py
to get the results
run flask_server.py
and flask_client.py
.
run json_server.py
and json_client.py
.
There are two kinds of transfer strategys.
- Split the whole dataframe into small pieces row by rows.
- Encode each small pieces into data pieces and transfer them.
- Process: pd.DataFrame --> row-by-row data --> encoded row data --> transfer --> encoded row data -> row-by-row data -> whole data
- Encode the whole data using one encoded strategy.
- Split the whole encoded data into small data chunks and transfer them.
- Process: pd.DataFrame --> encoded data --> chunked data --> transfer --> chunked data -> encoded whole daat -> whole data
And we use some different packages:
You can change these implementation in the client.py
file.
如同之前一樣所有方法執行 5 次取平均(200M 為 3 次)
只貼出總共花費時間的平均值(total_mu_t) 的圖表
詳細數據可以在附檔 .xlsx 查看
orJSON 表現出色,只花費 csv 20% 或 json 50% 的時間
因為 message 量增加,chunked 的方法表現超越 row-by-row
dtCSV 表現追上 orJSON
chunked 方法已全面超越 row-by-row
dtCSV 則表現比 orJSON 更好,chunked dtCSV 成為最快的方法