Skip to content

Commit 1303a93

Browse files
committed
πŸ“ Add : Readme μΆ”κ°€
1 parent 46b0345 commit 1303a93

File tree

1 file changed

+132
-1
lines changed

1 file changed

+132
-1
lines changed

β€ŽREADME.md

+132-1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,133 @@
11
# ParquetLoader
2-
Parquet file Load and Read from minio & S3
2+
<a href="https://github.com/Keunyoung-Jung/ParquetLoader/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/Keunyoung-Jung/ParquetLoader.svg" /></a>
3+
<a href="https://github.com/Keunyoung-Jung/ParquetLoader/issues"><img alt="Issues" src="https://img.shields.io/github/issues/Keunyoung-Jung/ParquetLoader"/></a>
4+
Parquet file Load and Read from minio &amp; S3 or Local
5+
This repository help read parquet file, When you train model or Analysis using bigdata.
6+
7+
## 1. Installation
8+
### 1.1. Install from pip
9+
```shell
10+
pip install parquet-loader
11+
```
12+
### 1.2 Install from source codes
13+
```shell
14+
git clone https://github.com/Keunyoung-Jung/ParquetLoader
15+
cd ParquetLoader
16+
pip install -e .
17+
```
18+
## 2. Introduce
19+
**ParquetLoader** can help to read large parquet files.
20+
**ParquetLoader** is built on base of pandas and fastparquet, which helps in situations where Spark clusters are not available.
21+
22+
It proceeds to load data into memory based on chuck size.
23+
Then return it as a pandas dataframe.
24+
## 3. Quick Start
25+
### 3.1. Local Path
26+
If your file is located `local`, you can load the data this way.
27+
```python
28+
from ParquetLoader import DataLoader
29+
30+
dl = DataLoader(
31+
folder='parquet_data'
32+
shuffle=False
33+
)
34+
for df in dl :
35+
print(df.head())
36+
```
37+
### 3.2. S3 Path
38+
If your file is located `S3` or `Minio`, you have to set
39+
environment variable.
40+
```shell
41+
export AWS_ACCESS_KEY_ID=my-access-key
42+
export AWS_SECRET_ACCESS_KEY=my-secret-key
43+
export AWS_DEFAULT_REGION=ap-northeast-2
44+
```
45+
When you have finished setting, you can load data this way.
46+
```python
47+
from ParquetLoader import S3Loader
48+
49+
sl = S3Loader(
50+
bucket='mysterico-feature-store',
51+
folder="mongo-sentence-token-feature",
52+
depth=2)
53+
54+
for df in sl :
55+
print(df.head())
56+
```
57+
58+
## 4. Parameters
59+
`ParquetLoader` can control reading data using parameters.
60+
The only difference between `S3Loader` and `DataLoader` is the `root_path` parameter.
61+
```python
62+
dl = DataLoader(
63+
chunk_size : int =100000,
64+
root_path : str = '.', # S3Loader using "bucket"
65+
folder : str = 'data',
66+
shuffle : bool = True,
67+
random_seed : int = int((time() - int(time()))*100000),
68+
columns : list = None,
69+
depth : int = 0,
70+
std_out: bool = True
71+
)
72+
```
73+
* `chunk_size`
74+
* default : 100,000 row
75+
* This parameter controls the number of rows loaded into memory when reading data.
76+
* `root_path` or `bucket`
77+
* default : current path
78+
* This parameter is used to specify the project path or datastore path.
79+
* `folder`
80+
* default : "data"
81+
* This parameter specifies the actual folder in which the parquet is clustered.
82+
* `shuffle`
83+
* default : True
84+
* Whether to shuffle the data.
85+
* `random_seed`
86+
* default : `int((time() - int(time()))*100000)`
87+
* You can fix the order of the shuffled data by giving it a random seed.
88+
* `columns`
89+
* default : None
90+
* When reading data, you can select columns.
91+
* `depth`
92+
* default : 0
93+
* It is used when the parquet in the folder is partitioned and there is depth.
94+
* `std_out`
95+
* default : True
96+
* You can turn off output.
97+
98+
### 4.1. Select Columns
99+
`columns` param is taken as a list.
100+
```python
101+
dl = DataLoader(
102+
folder='parquet_data',
103+
colums=['name','age','gender']
104+
)
105+
```
106+
### 4.2. Setting depth
107+
Use if your parquet file is partitioned and depth exists.
108+
**Example**
109+
```
110+
πŸ“¦ data
111+
┣ πŸ“¦ Year=2020
112+
┣ πŸ“œ part-0000-example.snappy.parquet
113+
β”— πŸ“œ part-0001-example.snappy.parquet
114+
┣ πŸ“¦ Year=2021
115+
┣ πŸ“œ part-0002-example.snappy.parquet
116+
β”— πŸ“œ part-0003-example.snappy.parquet
117+
```
118+
The data path in this example has a `depth 1`.
119+
```python
120+
dl = DataLoader(
121+
folder='parquet_data',
122+
depth=1
123+
)
124+
```
125+
126+
## 5. Get Metadata
127+
`DataLoader` Object can get metadata your parquet
128+
```python
129+
print(data_loader.schema) # get data schema
130+
print(data_loader.columns) # get data columns
131+
print(data_loader.count) # get total count
132+
print(data_loader.info) # get metadata infomation
133+
```

0 commit comments

Comments
Β (0)