Skip to content

Commit 6b03a6d

Browse files
authored
Merge pull request #82 from mmolimar/develop
Relese version 1.3.0
2 parents 98c5641 + 79f6e4f commit 6b03a6d

27 files changed

+1082
-50
lines changed

Dockerfile

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
FROM confluentinc/cp-kafka-connect-base:5.5.1
1+
FROM confluentinc/cp-kafka-connect-base:6.1.0
22

33
ARG PROJECT_VERSION
44
ENV CONNECT_PLUGIN_PATH="/usr/share/java,/usr/share/confluent-hub-components"

docker-compose.yml

+4-4
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
version: '3'
22
services:
33
cp-zookeeper:
4-
image: confluentinc/cp-zookeeper:5.5.1
4+
image: confluentinc/cp-zookeeper:6.1.0
55
hostname: zookeeper
66
container_name: zookeeper
77
ports:
@@ -11,7 +11,7 @@ services:
1111
ZOOKEEPER_TICK_TIME: 2000
1212

1313
cp-kafka:
14-
image: confluentinc/cp-kafka:5.5.1
14+
image: confluentinc/cp-kafka:6.1.0
1515
hostname: kafka
1616
container_name: kafka
1717
depends_on:
@@ -32,7 +32,7 @@ services:
3232
CONFLUENT_METRICS_ENABLE: 'false'
3333

3434
cp-schema-registry:
35-
image: confluentinc/cp-schema-registry:5.5.1
35+
image: confluentinc/cp-schema-registry:6.1.0
3636
hostname: schema-registry
3737
container_name: schema-registry
3838
depends_on:
@@ -45,7 +45,7 @@ services:
4545
SCHEMA_REGISTRY_KAFKASTORE_CONNECTION_URL: 'zookeeper:2181'
4646

4747
connect-fs:
48-
image: mmolimar/kafka-connect-fs:1.2.0
48+
image: mmolimar/kafka-connect-fs:1.3.0
4949
container_name: connect
5050
depends_on:
5151
- cp-kafka

docs/source/conf.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -55,9 +55,9 @@
5555
# built documents.
5656
#
5757
# The short X.Y version.
58-
version = '1.2'
58+
version = '1.3'
5959
# The full version, including alpha/beta/rc tags.
60-
release = '1.2'
60+
release = '1.3'
6161

6262
# The language for content autogenerated by Sphinx. Refer to documentation
6363
# for a list of supported languages.

docs/source/config_options.rst

+93-3
Original file line numberDiff line numberDiff line change
@@ -57,8 +57,10 @@ General config properties for this connector.
5757
If you want to ingest data from S3, you can add credentials with:
5858
``policy.fs.fs.s3a.access.key=<ACCESS_KEY>``
5959
and
60-
``policy.fs.fs.s3a.secret.key=<SECRET_KEY>``
61-
 
60+
``policy.fs.fs.s3a.secret.key=<SECRET_KEY>``.
61+
Also, in case you want to configure a custom credentials provider, you should use
62+
``policy.fs.fs.s3a.aws.credentials.provider=<CLASS>`` property.
63+
6264
``topic``
6365
Topic in which copy data to.
6466

@@ -224,7 +226,7 @@ HDFS file watcher
224226
In order to configure custom properties for this policy, the name you must use is ``hdfs_file_watcher``.
225227

226228
``policy.hdfs_file_watcher.poll``
227-
Time to wait until the records retrieved from the file watcher will be sent to the source task.
229+
Time to wait (in milliseconds) until the records retrieved from the file watcher will be sent to the source task.
228230

229231
* Type: long
230232
* Default: ``5000``
@@ -237,6 +239,52 @@ In order to configure custom properties for this policy, the name you must use i
237239
* Default: ``20000``
238240
* Importance: medium
239241

242+
.. _config_options-policies-s3events:
243+
244+
S3 event notifications
245+
--------------------------------------------
246+
247+
In order to configure custom properties for this policy, the name you must use is ``s3_event_notifications``.
248+
249+
``policy.s3_event_notifications.queue``
250+
SQS queue name to retrieve messages from.
251+
252+
* Type: string
253+
* Importance: high
254+
255+
``policy.s3_event_notifications.poll``
256+
Time to wait (in milliseconds) until the records retrieved from the queue will be sent to the source task.
257+
258+
* Type: long
259+
* Default: ``5000``
260+
* Importance: medium
261+
262+
``policy.s3_event_notifications.event_regex``
263+
Regular expression to filter event based on their types.
264+
265+
* Type: string
266+
* Default: ``.*``
267+
* Importance: medium
268+
269+
``policy.s3_event_notifications.delete_messages``
270+
If messages from SQS should be removed after reading them.
271+
272+
* Type: boolean
273+
* Default: ``true``
274+
* Importance: medium
275+
276+
``policy.s3_event_notifications.max_messages``
277+
Maximum number of messages to retrieve at a time (must be between 1 and 10).
278+
279+
* Type: int
280+
* Importance: medium
281+
282+
``policy.s3_event_notifications.visibility_timeout``
283+
Duration (in seconds) that the received messages are hidden from subsequent retrieve requests.
284+
285+
* Type: int
286+
* Importance: low
287+
240288
.. _config_options-filereaders:
241289

242290
File readers
@@ -357,6 +405,13 @@ In order to configure custom properties for this reader, the name you must use i
357405
* Default: ``true``
358406
* Importance: medium
359407

408+
``file_reader.cobol.reader.is_text``
409+
If line ending characters will be used (LF / CRLF) as the record separator.
410+
411+
* Type: boolean
412+
* Default: ``false``
413+
* Importance: medium
414+
360415
``file_reader.cobol.reader.ebcdic_code_page``
361416
Code page to be used for EBCDIC to ASCII / Unicode conversions.
362417

@@ -448,6 +503,13 @@ In order to configure custom properties for this reader, the name you must use i
448503
* Default: ``false``
449504
* Importance: low
450505

506+
``file_reader.cobol.reader.record_length``
507+
Specifies the length of the record disregarding the copybook record size. Implied the file has fixed record length.
508+
509+
* Type: int
510+
* Default: ``null``
511+
* Importance: low
512+
451513
``file_reader.cobol.reader.length_field_name``
452514
The name for a field that contains the record length. If not set, the copybook record length will be used.
453515

@@ -539,20 +601,41 @@ In order to configure custom properties for this reader, the name you must use i
539601
* Default: ``null``
540602
* Importance: low
541603

604+
``file_reader.cobol.reader.record_extractor``
605+
Parser to be used to parse records.
606+
607+
* Type: string
608+
* Default: ``null``
609+
* Importance: low
610+
542611
``file_reader.cobol.reader.rhp_additional_info``
543612
Extra option to be passed to a custom record header parser.
544613

545614
* Type: string
546615
* Default: ``null``
547616
* Importance: low
548617

618+
``file_reader.cobol.reader.re_additional_info``
619+
A string provided for the raw record extractor.
620+
621+
* Type: string
622+
* Default: ````
623+
* Importance: low
624+
549625
``file_reader.cobol.reader.input_file_name_column``
550626
A column name to add to each record containing the input file name.
551627

552628
* Type: string
553629
* Default: ````
554630
* Importance: low
555631

632+
.. _config_options-filereaders-binary:
633+
634+
Binary
635+
--------------------------------------------
636+
637+
There are no extra configuration options for this file reader.
638+
556639
.. _config_options-filereaders-csv:
557640

558641
CSV
@@ -1258,6 +1341,13 @@ To configure custom properties for this reader, the name you must use is ``agnos
12581341
* Default: ``dat``
12591342
* Importance: medium
12601343

1344+
``file_reader.agnostic.extensions.binary``
1345+
A comma-separated string list with the accepted extensions for binary files.
1346+
1347+
* Type: string[]
1348+
* Default: ``bin``
1349+
* Importance: medium
1350+
12611351
``file_reader.agnostic.extensions.csv``
12621352
A comma-separated string list with the accepted extensions for CSV files.
12631353

docs/source/connector.rst

+1
Original file line numberDiff line numberDiff line change
@@ -160,6 +160,7 @@ The are several file readers included which can read the following file formats:
160160
* ORC.
161161
* SequenceFile.
162162
* Cobol / EBCDIC.
163+
* Other binary files.
163164
* CSV.
164165
* TSV.
165166
* Fixed-width.

docs/source/filereaders.rst

+21
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,26 @@ translate it into a Kafka message with the schema.
6060

6161
More information about properties of this file reader :ref:`here<config_options-filereaders-cobol>`.
6262

63+
Binary
64+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
65+
66+
All other kind of binary files can be ingested using this reader.
67+
68+
It just extracts the content plus some metadata such as: path, file owner, file group, length, access time,
69+
and modification time.
70+
71+
Each message will contain the following schema:
72+
73+
* ``path``: File path (string).
74+
* ``owner``: Owner of the file. (string).
75+
* ``group``: Group associated with the file. (string).
76+
* ``length``: Length of this file, in bytes. (long).
77+
* ``access_time``: Access time of the file. (long).
78+
* ``modification_time``: Modification time of the file (long).
79+
* ``content``: Content of the file (bytes).
80+
81+
More information about properties of this file reader :ref:`here<config_options-filereaders-binary>`.
82+
6383
CSV
6484
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
6585

@@ -153,6 +173,7 @@ Default extensions for each format (configurable):
153173
* ORC: ``.orc``
154174
* SequenceFile: ``.seq``
155175
* Cobol / EBCDIC: ``.dat``
176+
* Other binary files: ``.bin``
156177
* CSV: ``.csv``
157178
* TSV: ``.tsv``
158179
* FixedWidth: ``.fixed``

docs/source/policies.rst

+12
Original file line numberDiff line numberDiff line change
@@ -36,3 +36,15 @@ You can learn more about the properties of this policy :ref:`here<config_options
3636
.. attention:: The URIs included in the general property ``fs.uris`` will be filtered and only those
3737
ones which start with the prefix ``hdfs://`` will be watched. Also, this policy
3838
will only work for Hadoop versions 2.6.0 or higher.
39+
40+
S3 event notifications
41+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
42+
43+
It uses S3 event notifications sent from S3 to process files which have been created or modified in S3.
44+
These notifications will be read from a AWS-SQS queue and they can be sent to SQS directly from S3 or via
45+
AWS-SNS, either as a SNS notification or a raw message in the subscription.
46+
47+
Just use it when you have S3 URIs and the event notifications in the S3 bucket must be enabled to a SNS
48+
topic or a SQS queue.
49+
50+
You can learn more about the properties of this policy :ref:`here<config_options-policies-s3events>`.

pom.xml

+30-17
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
<groupId>com.github.mmolimar.kafka.connect</groupId>
66
<artifactId>kafka-connect-fs</artifactId>
7-
<version>1.2.0</version>
7+
<version>1.3.0</version>
88
<packaging>jar</packaging>
99

1010
<name>kafka-connect-fs</name>
@@ -46,31 +46,31 @@
4646

4747
<properties>
4848
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
49-
<kafka.version>2.6.0</kafka.version>
50-
<confluent.version>5.5.1</confluent.version>
49+
<kafka.version>2.7.0</kafka.version>
50+
<confluent.version>6.1.0</confluent.version>
5151
<hadoop.version>3.3.0</hadoop.version>
52-
<gcs-connector.version>hadoop3-2.1.5</gcs-connector.version>
52+
<gcs-connector.version>hadoop3-2.2.0</gcs-connector.version>
5353
<parquet.version>1.11.1</parquet.version>
54-
<orc.version>1.6.3</orc.version>
55-
<univocity.version>2.9.0</univocity.version>
54+
<orc.version>1.6.7</orc.version>
55+
<univocity.version>2.9.1</univocity.version>
5656
<jackson-dataformat.version>2.10.2</jackson-dataformat.version>
57-
<cobrix.version>2.1.1</cobrix.version>
58-
<scala.version>2.12.12</scala.version>
59-
<cron-utils.version>9.1.1</cron-utils.version>
57+
<cobrix.version>2.2.0</cobrix.version>
58+
<scala.version>2.12.13</scala.version>
59+
<cron-utils.version>9.1.3</cron-utils.version>
6060
<jsch.version>0.1.55</jsch.version>
61-
<junit-jupiter.version>5.7.0</junit-jupiter.version>
61+
<junit-jupiter.version>5.7.1</junit-jupiter.version>
6262
<easymock.version>4.2</easymock.version>
63-
<powermock.version>2.0.7</powermock.version>
63+
<powermock.version>2.0.9</powermock.version>
6464
<maven-compiler.source>1.8</maven-compiler.source>
6565
<maven-compiler.target>${maven-compiler.source}</maven-compiler.target>
6666
<maven-jar-plugin.version>3.2.0</maven-jar-plugin.version>
6767
<maven-compiler-plugin.version>3.8.1</maven-compiler-plugin.version>
6868
<maven-scala-plugin.version>4.4.0</maven-scala-plugin.version>
6969
<maven-assembly-plugin.version>3.3.0</maven-assembly-plugin.version>
70-
<maven-jacoco-plugin.version>0.8.5</maven-jacoco-plugin.version>
70+
<maven-jacoco-plugin.version>0.8.6</maven-jacoco-plugin.version>
7171
<maven-coveralls-plugin.version>4.3.0</maven-coveralls-plugin.version>
7272
<maven-surfire-plugin.version>3.0.0-M5</maven-surfire-plugin.version>
73-
<maven-kafka-connect-plugin.version>0.11.3</maven-kafka-connect-plugin.version>
73+
<maven-kafka-connect-plugin.version>0.12.0</maven-kafka-connect-plugin.version>
7474
</properties>
7575

7676
<dependencies>
@@ -139,6 +139,12 @@
139139
<groupId>za.co.absa.cobrix</groupId>
140140
<artifactId>cobol-parser_2.12</artifactId>
141141
<version>${cobrix.version}</version>
142+
<exclusions>
143+
<exclusion>
144+
<groupId>org.scala-lang</groupId>
145+
<artifactId>scala-library</artifactId>
146+
</exclusion>
147+
</exclusions>
142148
</dependency>
143149
<dependency>
144150
<groupId>com.cronutils</groupId>
@@ -150,6 +156,11 @@
150156
<artifactId>jsch</artifactId>
151157
<version>${jsch.version}</version>
152158
</dependency>
159+
<dependency>
160+
<groupId>org.scala-lang</groupId>
161+
<artifactId>scala-library</artifactId>
162+
<version>${scala.version}</version>
163+
</dependency>
153164

154165
<!-- Test dependencies -->
155166
<dependency>
@@ -298,11 +309,12 @@
298309
into Kafka.
299310
300311
The following file types are supported: Parquet, Avro, ORC, SequenceFile,
301-
Cobol / EBCDIC, CSV, TSV, Fixed-width, JSON, XML, YAML and Text.
312+
Cobol / EBCDIC, other binary files, CSV, TSV, Fixed-width, JSON, XML, YAML and Text.
302313
303-
Also, the connector has built-in support for file systems such as HDFS, S3,
304-
Google Cloud Storage, Azure Blob Storage, Azure Data Lake Store, FTP, SFTP and
305-
local file system, among others.
314+
Also, the connector has built-in support for file systems such as HDFS, S3 (directly
315+
or via messages from SNS/SQS queues due to S3 event notifications), Google Cloud
316+
Storage, Azure Blob Storage, Azure Data Lake Store, FTP, SFTP and local file system,
317+
among others.
306318
]]></description>
307319
<sourceUrl>https://github.com/mmolimar/kafka-connect-fs</sourceUrl>
308320

@@ -340,6 +352,7 @@
340352
<tag>orc</tag>
341353
<tag>sequence</tag>
342354
<tag>cobol</tag>
355+
<tag>binary</tag>
343356
<tag>csv</tag>
344357
<tag>tsv</tag>
345358
<tag>fixed</tag>

src/main/java/com/github/mmolimar/kafka/connect/fs/FsSourceTask.java

+2-1
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ public String version() {
4646
}
4747

4848
@Override
49+
@SuppressWarnings("unchecked")
4950
public void start(Map<String, String> properties) {
5051
log.info("{} Starting FS source task...", this);
5152
try {
@@ -94,7 +95,7 @@ public List<SourceRecord> poll() {
9495
Struct record = reader.next();
9596
// TODO change FileReader interface in the next major version
9697
boolean hasNext = (reader instanceof AbstractFileReader) ?
97-
((AbstractFileReader) reader).hasNextBatch() || reader.hasNext() : reader.hasNext();
98+
((AbstractFileReader<?>) reader).hasNextBatch() || reader.hasNext() : reader.hasNext();
9899
records.add(convert(metadata, reader.currentOffset(), !hasNext, record));
99100
}
100101
} catch (IOException | ConnectException e) {

0 commit comments

Comments
 (0)