Parquet file Load and Read from minio & S3 or Local
This repository help read parquet file, When you train model or Analysis using bigdata.
pip install parquet-loader
git clone https://github.com/Keunyoung-Jung/ParquetLoader
cd ParquetLoader
pip install -e .
ParquetLoader can help to read large parquet files.
ParquetLoader is built on base of pandas and fastparquet, which helps in situations where Spark clusters are not available.
It proceeds to load data into memory based on chuck size.
Then return it as a pandas dataframe.
If your file is located local
, you can load the data this way.
from ParquetLoader import DataLoader
dl = DataLoader(
folder='parquet_data'
shuffle=False
)
for df in dl :
print(df.head())
If your file is located S3
or Minio
, you have to set
environment variable.
export AWS_ACCESS_KEY_ID=my-access-key
export AWS_SECRET_ACCESS_KEY=my-secret-key
export AWS_DEFAULT_REGION=ap-northeast-2
When you have finished setting, you can load data this way.
from ParquetLoader import S3Loader
sl = S3Loader(
bucket='mysterico-feature-store',
folder="mongo-sentence-token-feature",
depth=2)
for df in sl :
print(df.head())
ParquetLoader
can control reading data using parameters.
The only difference between S3Loader
and DataLoader
is the root_path
parameter.
dl = DataLoader(
chunk_size : int =100000,
root_path : str = '.', # S3Loader using "bucket"
folder : str = 'data',
shuffle : bool = True,
random_seed : int = int((time() - int(time()))*100000),
columns : list = None,
depth : int = 0,
std_out: bool = True
)
chunk_size
- default : 100,000 row
- This parameter controls the number of rows loaded into memory when reading data.
root_path
orbucket
- default : current path
- This parameter is used to specify the project path or datastore path.
folder
- default : "data"
- This parameter specifies the actual folder in which the parquet is clustered.
shuffle
- default : True
- Whether to shuffle the data.
random_seed
- default :
int((time() - int(time()))*100000)
- You can fix the order of the shuffled data by giving it a random seed.
- default :
columns
- default : None
- When reading data, you can select columns.
depth
- default : 0
- It is used when the parquet in the folder is partitioned and there is depth.
std_out
- default : True
- You can turn off output.
columns
param is taken as a list.
dl = DataLoader(
folder='parquet_data',
colums=['name','age','gender']
)
Use if your parquet file is partitioned and depth exists.
Example
π¦ data
β£ π¦ Year=2020
β£ π part-0000-example.snappy.parquet
β π part-0001-example.snappy.parquet
β£ π¦ Year=2021
β£ π part-0002-example.snappy.parquet
β π part-0003-example.snappy.parquet
The data path in this example has a depth 1
.
dl = DataLoader(
folder='parquet_data',
depth=1
)
DataLoader
Object can get metadata your parquet
print(data_loader.schema) # get data schema
print(data_loader.columns) # get data columns
print(data_loader.count) # get total count
print(data_loader.info) # get metadata infomation