Special wrapper around the native HDFSEventSink
to be able to use empty inUseSuffix.
HDFSEvenSink was designed to write data into HDFS. When flume is writing data into HDFS it does make sense to use tmp suffix, because client has to able to distinguish final data and data "in-progress".
S3 doesn't support "append" operation, so flume follows the next workflow:
- creates temporary file on the agent machine and writes new events to it
- when file is ready, flume copies it on s3 with
inUseSuffixin the end - finally, flume renames the file by removing
inUseSuffix
Renaming of files on s3 is essentially 2 operations: "copy to a new file" and "remove the old one".
I was trying to raise the question via flume user-list,
but without success. Flume doesn't allow you to specify empty inUseSuffix because of:
https://github.com/apache/flume/blob/flume-1.6/flume-ng-configuration/src/main/java/org/apache/flume/conf/FlumeConfiguration.java#L155
To build the jar file for the sink (tested with gradle 2.2.1):
brew install gradle
gradle build
To use the sink:
- Place jar into the plugins directory:
mkdir -p $FLUME_HOME/plugins.d/flume-s3-sink/lib
cp build/libs/flume-s3-sink-1.0.jar $FLUME_HOME/plugins.d/flume-s3-sink/lib
- Configure the sink:
agent.sinks.my_s3sink.type = org.apache.flume.sink.s3.S3Sink
# Other options are the same as for https://flume.apache.org/FlumeUserGuide.html#hdfs-sink