Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions package.xml
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,13 @@
</includes>
<outputDirectory>datax</outputDirectory>
</fileSet>
<fileSet>
<directory>s3writer/target/datax/</directory>
<includes>
<include>**/*.*</include>
</includes>
<outputDirectory>datax</outputDirectory>
</fileSet>
<fileSet>
<directory>ftpwriter/target/datax/</directory>
<includes>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,4 +35,13 @@ public class Key {

// writer file type suffix, like .txt .csv
public static final String SUFFIX = "suffix";

public static final String S3_BUCKET = "s3Bucket";

public static final String S3_ACCESS_KEY = "s3AccessKey";

public static final String S3_SECRET_KEY = "s3SecretKey";

public static final String S3_ENDPOINT = "s3Endpoint";

}
1 change: 1 addition & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@
<module>rdbmswriter</module>
<module>hbase11xwriter</module>
<module>hbase094xwriter</module>
<module>s3writer</module>

<!-- some support module -->
<module>plugin-rdbms-util</module>
Expand Down
216 changes: 216 additions & 0 deletions s3writer/doc/s3writer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
# DataX S3Writer 说明


------------

## 1 快速介绍

S3Writer提供了向S3写入类CSV格式的一个或者多个表文件。

**写入S3文件内容存放的是一张逻辑意义上的二维表,例如CSV格式的文本信息。**


## 2 功能与限制

S3Writer实现了从DataX协议转为S3TXT文件功能,S3文件本身是无结构化数据存储,S3Writer如下几个方面约定:

1. 支持且仅支持写入 TXT的文件,且要求TXT中shema为一张二维表。

2. 支持类CSV格式文件,自定义分隔符。

3. 支持文本压缩,现有压缩格式为gzip、bzip2。

6. 支持多线程写入,每个线程写入不同子文件。

7. 文件支持滚动,当文件大于某个size值或者行数值,文件需要切换。 [暂不支持]

我们不能做到:

1. 单个文件不能支持并发写入。


## 3 功能说明


### 3.1 配置样例

```json
{
"setting": {},
"job": {
"setting": {
"speed": {
"channel": 2
}
},
"content": [
{
"reader": {
"name": "txtfilereader",
"parameter": {
"path": ["/home/haiwei.luo/case00/data"],
"encoding": "UTF-8",
"column": [
{
"index": 0,
"type": "long"
},
{
"index": 1,
"type": "boolean"
},
{
"index": 2,
"type": "double"
},
{
"index": 3,
"type": "string"
},
{
"index": 4,
"type": "date",
"format": "yyyy.MM.dd"
}
],
"fieldDelimiter": ","
}
},
"writer": {
"name": "txtfilewriter",
"parameter": {
"path": "/home/haiwei.luo/case00/result",
"fileName": "luohw",
"writeMode": "truncate",
"dateFormat": "yyyy-MM-dd"
}
}
}
]
}
}
```

### 3.2 参数说明

* **path**

* 描述:S3文件系统的路径信息,S3Writer会写入Path目录下属多个文件。 <br />

* 必选:是 <br />

* 默认值:无 <br />

* **fileName**

* 描述:S3Writer写入的文件名,该文件名会添加随机的后缀作为每个线程写入实际文件名。 <br />

* 必选:是 <br />

* 默认值:无 <br />

* **writeMode**

* 描述:S3Writer写入前数据清理处理模式: <br />

* truncate,写入前清理目录下一fileName前缀的所有文件。
* append,写入前不做任何处理,DataX S3Writer直接使用filename写入,并保证文件名不冲突。
* nonConflict,如果目录下有fileName前缀的文件,直接报错。

* 必选:是 <br />

* 默认值:无 <br />

* **fieldDelimiter**

* 描述:读取的字段分隔符 <br />

* 必选:否 <br />

* 默认值:, <br />

* **compress**

* 描述:文本压缩类型,默认不填写意味着没有压缩。支持压缩类型为zip、lzo、lzop、tgz、bzip2。 <br />

* 必选:否 <br />

* 默认值:无压缩 <br />

* **encoding**

* 描述:读取文件的编码配置。<br />

* 必选:否 <br />

* 默认值:utf-8 <br />


* **nullFormat**

* 描述:文本文件中无法使用标准字符串定义null(空指针),DataX提供nullFormat定义哪些字符串可以表示为null。<br />

例如如果用户配置: nullFormat="\N",那么如果源头数据是"\N",DataX视作null字段。

* 必选:否 <br />

* 默认值:\N <br />

* **dateFormat**

* 描述:日期类型的数据序列化到文件中时的格式,例如 "dateFormat": "yyyy-MM-dd"。<br />

* 必选:否 <br />

* 默认值:无 <br />

* **fileFormat**

* 描述:文件写出的格式,包括csv (http://zh.wikipedia.org/wiki/%E9%80%97%E5%8F%B7%E5%88%86%E9%9A%94%E5%80%BC) 和text两种,csv是严格的csv格式,如果待写数据包括列分隔符,则会按照csv的转义语法转义,转义符号为双引号";text格式是用列分隔符简单分割待写数据,对于待写数据包括列分隔符情况下不做转义。<br />

* 必选:否 <br />

* 默认值:text <br />

* **header**

* 描述:txt写出时的表头,示例['id', 'name', 'age']。<br />

* 必选:否 <br />

* 默认值:无 <br />

### 3.3 类型转换


S3文件本身不提供数据类型,该类型是DataX S3Writer定义:

| DataX 内部类型| S3文件 数据类型 |
| -------- | ----- |
|
| Long |Long |
| Double |Double|
| String |String|
| Boolean |Boolean |
| Date |Date |

其中:

* S3文件 Long是指S3文件文本中使用整形的字符串表示形式,例如"19901219"。
* S3文件 Double是指S3文件文本中使用Double的字符串表示形式,例如"3.1415"。
* S3文件 Boolean是指S3文件文本中使用Boolean的字符串表示形式,例如"true"、"false"。不区分大小写。
* S3文件 Date是指S3文件文本中使用Date的字符串表示形式,例如"2014-12-31",Date可以指定format格式。


## 4 性能报告


## 5 约束限制


## 6 FAQ



88 changes: 88 additions & 0 deletions s3writer/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>com.alibaba.datax</groupId>
<artifactId>datax-all</artifactId>
<version>0.0.1-SNAPSHOT</version>
</parent>

<artifactId>s3writer</artifactId>
<name>s3writer</name>
<description>S3Writer提供了本地写入TEXT功能,建议开发、测试环境使用。</description>
<packaging>jar</packaging>

<dependencies>
<dependency>
<groupId>com.alibaba.datax</groupId>
<artifactId>datax-common</artifactId>
<version>${datax-project-version}</version>
<exclusions>
<exclusion>
<artifactId>slf4j-log4j12</artifactId>
<groupId>org.slf4j</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.alibaba.datax</groupId>
<artifactId>plugin-unstructured-storage-util</artifactId>
<version>${datax-project-version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>16.0.1</version>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个版本感觉有点太低了

</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-core</artifactId>
<version>1.11.52</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-s3</artifactId>
<version>1.11.52</version>
</dependency>
</dependencies>

<build>
<plugins>
<!-- compiler plugin -->
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.6</source>
<target>1.6</target>
<encoding>${project-sourceEncoding}</encoding>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptors>
<descriptor>src/main/assembly/package.xml</descriptor>
</descriptors>
<finalName>datax</finalName>
</configuration>
<executions>
<execution>
<id>dwzip</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
35 changes: 35 additions & 0 deletions s3writer/src/main/assembly/package.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
<assembly
xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0 http://maven.apache.org/xsd/assembly-1.1.0.xsd">
<id></id>
<formats>
<format>dir</format>
</formats>
<includeBaseDirectory>false</includeBaseDirectory>
<fileSets>
<fileSet>
<directory>src/main/resources</directory>
<includes>
<include>plugin.json</include>
<include>plugin_job_template.json</include>
</includes>
<outputDirectory>plugin/writer/s3writer</outputDirectory>
</fileSet>
<fileSet>
<directory>target/</directory>
<includes>
<include>s3writer-0.0.1-SNAPSHOT.jar</include>
</includes>
<outputDirectory>plugin/writer/s3writer</outputDirectory>
</fileSet>
</fileSets>

<dependencySets>
<dependencySet>
<useProjectArtifact>false</useProjectArtifact>
<outputDirectory>plugin/writer/s3writer/libs</outputDirectory>
<scope>runtime</scope>
</dependencySet>
</dependencySets>
</assembly>
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
package com.alibaba.datax.plugin.writer.s3writer;

public class Key {
// must have
public static final String PATH = "path";
}
Loading