|
1 | 1 | # ParquetLoader
|
2 |
| -Parquet file Load and Read from minio & S3 |
| 2 | +<a href="https://github.com/Keunyoung-Jung/ParquetLoader/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/Keunyoung-Jung/ParquetLoader.svg" /></a> |
| 3 | +<a href="https://github.com/Keunyoung-Jung/ParquetLoader/issues"><img alt="Issues" src="https://img.shields.io/github/issues/Keunyoung-Jung/ParquetLoader"/></a> |
| 4 | +Parquet file Load and Read from minio & S3 or Local |
| 5 | +This repository help read parquet file, When you train model or Analysis using bigdata. |
| 6 | + |
| 7 | +## 1. Installation |
| 8 | +### 1.1. Install from pip |
| 9 | +```shell |
| 10 | +pip install parquet-loader |
| 11 | +``` |
| 12 | +### 1.2 Install from source codes |
| 13 | +```shell |
| 14 | +git clone https://github.com/Keunyoung-Jung/ParquetLoader |
| 15 | +cd ParquetLoader |
| 16 | +pip install -e . |
| 17 | +``` |
| 18 | +## 2. Introduce |
| 19 | +**ParquetLoader** can help to read large parquet files. |
| 20 | +**ParquetLoader** is built on base of pandas and fastparquet, which helps in situations where Spark clusters are not available. |
| 21 | + |
| 22 | +It proceeds to load data into memory based on chuck size. |
| 23 | +Then return it as a pandas dataframe. |
| 24 | +## 3. Quick Start |
| 25 | +### 3.1. Local Path |
| 26 | +If your file is located `local`, you can load the data this way. |
| 27 | +```python |
| 28 | +from ParquetLoader import DataLoader |
| 29 | + |
| 30 | +dl = DataLoader( |
| 31 | + folder='parquet_data' |
| 32 | + shuffle=False |
| 33 | + ) |
| 34 | +for df in dl : |
| 35 | + print(df.head()) |
| 36 | +``` |
| 37 | +### 3.2. S3 Path |
| 38 | +If your file is located `S3` or `Minio`, you have to set |
| 39 | +environment variable. |
| 40 | +```shell |
| 41 | +export AWS_ACCESS_KEY_ID=my-access-key |
| 42 | +export AWS_SECRET_ACCESS_KEY=my-secret-key |
| 43 | +export AWS_DEFAULT_REGION=ap-northeast-2 |
| 44 | +``` |
| 45 | +When you have finished setting, you can load data this way. |
| 46 | +```python |
| 47 | +from ParquetLoader import S3Loader |
| 48 | + |
| 49 | +sl = S3Loader( |
| 50 | + bucket='mysterico-feature-store', |
| 51 | + folder="mongo-sentence-token-feature", |
| 52 | + depth=2) |
| 53 | + |
| 54 | +for df in sl : |
| 55 | + print(df.head()) |
| 56 | +``` |
| 57 | + |
| 58 | +## 4. Parameters |
| 59 | +`ParquetLoader` can control reading data using parameters. |
| 60 | +The only difference between `S3Loader` and `DataLoader` is the `root_path` parameter. |
| 61 | +```python |
| 62 | +dl = DataLoader( |
| 63 | + chunk_size : int =100000, |
| 64 | + root_path : str = '.', # S3Loader using "bucket" |
| 65 | + folder : str = 'data', |
| 66 | + shuffle : bool = True, |
| 67 | + random_seed : int = int((time() - int(time()))*100000), |
| 68 | + columns : list = None, |
| 69 | + depth : int = 0, |
| 70 | + std_out: bool = True |
| 71 | + ) |
| 72 | +``` |
| 73 | +* `chunk_size` |
| 74 | + * default : 100,000 row |
| 75 | + * This parameter controls the number of rows loaded into memory when reading data. |
| 76 | +* `root_path` or `bucket` |
| 77 | + * default : current path |
| 78 | + * This parameter is used to specify the project path or datastore path. |
| 79 | +* `folder` |
| 80 | + * default : "data" |
| 81 | + * This parameter specifies the actual folder in which the parquet is clustered. |
| 82 | +* `shuffle` |
| 83 | + * default : True |
| 84 | + * Whether to shuffle the data. |
| 85 | +* `random_seed` |
| 86 | + * default : `int((time() - int(time()))*100000)` |
| 87 | + * You can fix the order of the shuffled data by giving it a random seed. |
| 88 | +* `columns` |
| 89 | + * default : None |
| 90 | + * When reading data, you can select columns. |
| 91 | +* `depth` |
| 92 | + * default : 0 |
| 93 | + * It is used when the parquet in the folder is partitioned and there is depth. |
| 94 | +* `std_out` |
| 95 | + * default : True |
| 96 | + * You can turn off output. |
| 97 | + |
| 98 | +### 4.1. Select Columns |
| 99 | +`columns` param is taken as a list. |
| 100 | +```python |
| 101 | +dl = DataLoader( |
| 102 | + folder='parquet_data', |
| 103 | + colums=['name','age','gender'] |
| 104 | + ) |
| 105 | +``` |
| 106 | +### 4.2. Setting depth |
| 107 | +Use if your parquet file is partitioned and depth exists. |
| 108 | +**Example** |
| 109 | +``` |
| 110 | +π¦ data |
| 111 | + β£ π¦ Year=2020 |
| 112 | + β£ π part-0000-example.snappy.parquet |
| 113 | + β π part-0001-example.snappy.parquet |
| 114 | + β£ π¦ Year=2021 |
| 115 | + β£ π part-0002-example.snappy.parquet |
| 116 | + β π part-0003-example.snappy.parquet |
| 117 | +``` |
| 118 | +The data path in this example has a `depth 1`. |
| 119 | +```python |
| 120 | +dl = DataLoader( |
| 121 | + folder='parquet_data', |
| 122 | + depth=1 |
| 123 | + ) |
| 124 | +``` |
| 125 | + |
| 126 | +## 5. Get Metadata |
| 127 | +`DataLoader` Object can get metadata your parquet |
| 128 | +```python |
| 129 | +print(data_loader.schema) # get data schema |
| 130 | +print(data_loader.columns) # get data columns |
| 131 | +print(data_loader.count) # get total count |
| 132 | +print(data_loader.info) # get metadata infomation |
| 133 | +``` |
0 commit comments