Skip to content

Commit cb126e2

Browse files
committed
Merge pull request #23 from civitaspo/v0.2.0
V0.2.0
2 parents 837120e + faafe4b commit cb126e2

14 files changed

+568
-319
lines changed

Diff for: CHENGELOG.md

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
0.2.0 (2016-02-xx)
2+
==================
3+
- [Add] `decompression` option

Diff for: README.md

+18-15
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,12 @@ Read files on Hdfs.
1414

1515
- **config_files** list of paths to Hadoop's configuration files (array of strings, default: `[]`)
1616
- **config** overwrites configuration parameters (hash, default: `{}`)
17-
- **path** file path on Hdfs. you can use glob and Date format like `%Y%m%d/%s`.
18-
- **rewind_seconds** When you use Date format in input_path property, the format is executed by using the time which is Now minus this property.
19-
- **partition** when this is true, partition input files and increase task count. (default: `true`)
20-
- **num_partitions** number of partitions. (default: `Runtime.getRuntime().availableProcessors()`)
21-
- **skip_header_lines** Skip this number of lines first. Set 1 if the file has header line. (default: `0`)
17+
- **path** file path on Hdfs. you can use glob and Date format like `%Y%m%d/%s` (string, required).
18+
- **rewind_seconds** When you use Date format in input_path property, the format is executed by using the time which is Now minus this property. (long, default: `0`)
19+
- **partition** when this is true, partition input files and increase task count. (boolean, default: `true`)
20+
- **num_partitions** number of partitions. (long, default: `Runtime.getRuntime().availableProcessors()`)
21+
- **skip_header_lines** Skip this number of lines first. Set 1 if the file has header line. (long, default: `0`)
22+
- **decompression** Decompress compressed files by hadoop compression codec api. (boolean. default: `false`)
2223

2324
## Example
2425

@@ -77,18 +78,20 @@ int partitionSizeByOneTask = totalFileLength / approximateNumPartitions;
7778
...
7879
*/
7980

80-
int numPartitions;
81-
if (path.toString().endsWith(".gz") || path.toString().endsWith(".bz2") || path.toString().endsWith(".lzo")) {
82-
// if the file is compressed, skip partitioning.
83-
numPartitions = 1;
81+
long numPartitions;
82+
if (task.getPartition()) {
83+
if (file.canDecompress()) {
84+
numPartitions = ((fileLength - 1) / partitionSizeByOneTask) + 1;
85+
}
86+
else if (file.getCodec() != null) { // if not null, the file is compressed.
87+
numPartitions = 1;
88+
}
89+
else {
90+
numPartitions = ((fileLength - 1) / partitionSizeByOneTask) + 1;
91+
}
8492
}
85-
else if (!task.getPartition()) {
86-
// if no partition mode, skip partitioning.
87-
numPartitions = 1;
88-
}
8993
else {
90-
// equalize the file size per task as much as possible.
91-
numPartitions = ((fileLength - 1) / partitionSizeByOneTask) + 1;
94+
numPartitions = 1;
9295
}
9396

9497
/*

Diff for: build.gradle

+1-1
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ configurations {
1515
provided
1616
}
1717

18-
version = "0.1.9"
18+
version = "0.2.0"
1919

2020
sourceCompatibility = 1.7
2121
targetCompatibility = 1.7

0 commit comments

Comments
 (0)