This repository provides a solution to transfer Parquet files stored on HDFS to another server over the network using Apache Arrow Flight. The setup includes server and client components that handle efficient data transfer, embedding generation, and data retrieval.
- Apache Arrow Integration: Transfers data as Arrow tables, utilizing efficient in-memory data representation.
- RPC Communication: Uses
PutRPCto send data from client to server, followed by embedding generation on the server.GetRPCenables the client to retrieve processed embeddings after generation. - Embedding Generation: Embedding generation starts on the server after receiving data via
PutRPC.
-
server/
Contains code to receive data from the client, read Parquet files as Arrow tables, and perform embedding generation after data transfer. -
client/
InitiatesPutRPCto transfer data to the server and executesGetRPCto retrieve embeddings once generation is complete on the server.
-
Client Setup
- Navigate to
client/. - Run the client to send data to the server using
PutRPC.
- Navigate to
-
Server Setup
- Navigate to
server/. - The server receives the data, performs embedding generation, and makes processed data available for retrieval.
- Navigate to
-
Data Retrieval
- The client executes
GetRPCto retrieve the generated embeddings from the server.
- The client executes
- Apache Arrow Flight
- HDFS
- Parquet libraries