Skip to content

Commit d84ae1c

Browse files
Updated read me
1 parent 827c50c commit d84ae1c

File tree

1 file changed

+8
-84
lines changed

1 file changed

+8
-84
lines changed

README.md

Lines changed: 8 additions & 84 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
11
# Getting Started
22
## Solace Spark Connector
3-
Solace Spark Connector is based on DataSourceV2 API provided in Spark. The following are good places to start
4-
5-
* https://levelup.gitconnected.com/easy-guide-to-create-a-custom-read-data-source-in-apache-spark-3-194afdc9627a
3+
Solace Spark Connector is based on Spark DataSourceV2 API provided in Spark.
64

75
The "Getting Started" tutorials will get you up to speed and sending messages with Solace technology as quickly as possible. There are three ways you can get started:
86

@@ -12,94 +10,20 @@ The "Getting Started" tutorials will get you up to speed and sending messages wi
1210

1311
# Contents
1412

15-
This repository contains code that connects to specified Solace service and inserts data into Spark Internal row. Please note that this connector only support Spark read operation.
16-
17-
## Features
18-
19-
* The current connector enables user to configure Solace service details as required
20-
* It supports processing of specific number of messages in a partition(BatchSize option can be used to configure the number of messages per partition)
21-
* Message are acknowledged to Solace once commit method is triggered from Spark. Commit is triggered from Spark on successful write operation.
22-
* Message IDs are persisted in checkpoint location once commit is successfull. This helps in deduplication of messages and guaranteed processing.
13+
This repository contains code that can stream messages from Solace to Spark Internal row and also publish messages to Solace from Spark.
2314

24-
## Architecture
15+
## User Guide
2516

26-
<p align="center">
27-
<img width="700" height="400" src="https://user-images.githubusercontent.com/83568543/154064846-c1b5025f-9898-4190-90b5-84f05bd7fba9.png"/>
28-
</p>
17+
For complete user guide, please check the connector page on [Solace Integration Hub](https://solace.com/integration-hub/apache-spark/).
2918

30-
# Prerequisites
19+
## Maven Central
3120

32-
Apache Spark 3.5.1, Scala 2.12
21+
The connector is available in maven central as [pubsubplus-connector-spark](https://mvnrepository.com/artifact/com.solacecoe.connectors/pubsubplus-connector-spark)
3322

3423
# Build the connector
3524

36-
`mvn clean install`
37-
38-
# Running the connector
39-
40-
## Databricks environment
41-
42-
Create a Databricks cluster with Runtime with above Spark version and follow the steps below
43-
44-
1. In the libraries section upload the pubsubplus-connector-spark jar and also add the following maven dependencies using install new option available in Databricks cluster libraries section
45-
46-
<p align="center">
47-
<img src="https://github.com/SolaceTechCOE/spark-connector-v3/assets/83568543/b2bc0f68-d80e-422c-8fd3-af89034cdaca"/>
48-
</p>
49-
50-
## Running as a Job
51-
If you are running Spark Connector as a job, use the jar files provided as part of distribution to configure your job. In case of thin jar you need to provide dependencies as in above screenshot for the job. For class name and other connector configuration please refer sample scripts and configuration option sections
52-
53-
## Thin Jar vs Fat Jar
54-
Thin jar is light weight jar where only Spark Streaming API related dependencies are included. In this case Solace and Log4j related dependencies should be added during configuration of Spark Job. Spark supports adding these dependencies via maven or actual jars.
55-
56-
```com.solacesystems:sol-jcsmp:10.21.0``` & ```log4j:log4j:1.2.17```
57-
58-
Fat jar includes all the dependencies and it can be used directly without any additional dependencies.
59-
60-
## Solace Spark Schema
61-
62-
Solace Spark Connector transforms the incoming message to Spark row with below schema.
63-
64-
| Column Name | Column Type |
65-
|-------------|---------------------|
66-
| Id | String |
67-
| Payload | Binary |
68-
| Topic | String |
69-
| TimeStamp | Timestamp |
70-
| Headers | Map<string, binary> |
71-
72-
## Using Sample Script
73-
74-
### Databricks
75-
76-
1. Solace_Read_Stream_Script.txt
77-
78-
Create a new notebook in Databricks environment and set the language to Scala. Copy the Script to notebook and provide the required details. Once started, script reads data from Solace Queue and writes to parquet files.
79-
80-
2. Read_Parquet_Script.txt
81-
This scripts reads the data from parquet and displays the count of records. The output should match the number of records available in Solace queue before processing.
82-
83-
## Spark Job or Spark Submit
84-
85-
Spark Job or Spark Submit requires jar as input. You can convert above Scala scripts to jar and provide it as input and add pubsubplus-connector-solace jar(thin or fat) as dependency. In case of thin jar please note that additional dependencies need to be configured as mentioned in Thin Jar vs Fat Jar section.
86-
87-
## Configuration Options
88-
| Config Option | Type | Valid Values | Default Value | Description |
89-
|--------------------------|---------|------------------------|---------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
90-
| host | String | tcp(s)://hostname:port | Empty | Fully Qualified Solace Hostname with protocol and port number |
91-
| vpn | String | | Empty | Solace VPN Name |
92-
| username | String | | Empty | Solace Client Username |
93-
| password | String | | Empty | Solace Client Password |
94-
| queue | String | | Empty | Solace Queue name |
95-
| batchSize | Integer | | 1 | Set number of messages to be processed in batch. Default is set to 1 |
96-
| ackLastProcessedMessages | Boolean | | false | Set this value to true if connector needs to determine processed messages in last run. The connector purely depends on offset file generated during Spark commit. In some cases connector may acknowledge the message based on offset but same may not be available in your downstream systems. In general we recommend leaving this false and handle process/reprocess of messages in downstream systems |
97-
| skipDuplicates | Boolean | | false | Set this value to true if connector needs check for duplicates before adding to Spark row. This scenario occurs when the tasks are executing more than expected time and message is not acknowledged before the start of next task. In such cases the message will be added again to Spark row. |
98-
| offsetIndicator | String | | MESSAGE_ID | Set this value if your Solace Message has unique ID in message header. Supported Values are <ul><li> MESSAGE_ID </li><li> CORRELATION_ID </li> <li> APPLICATION_MESSAGE_ID </li> <li> CUSTOM_USER_PROPERTY </li> </ul> CUSTOM_USER_PROPERTY refers to one of headers in user properties header. <br/> <br/> Note: Default value uses replication group message ID property as offset indicator. <br/> <br/> A replication group message ID is an attribute of Solace messages, assigned by the event broker delivering the messages to the queue and topic endpoints, that uniquely identifies a message on a particular queue or topic endpoint within a high availability (HA) group and replication group of event brokers. |
99-
| includeHeaders | Boolean | | false | Set this value to true if message headers need to be included in output |
100-
| partitions | Integer | | 1 | Sets the number of consumers for configured queue. Equal number of partitions are created to process data received from consumers |
101-
| createFlowsOnSameSession | Boolean | | false | If enabled consumer flows are enabled on same session. The number of consumer flows is equal to number of partitions configured. This is helpful when users want to optimize on number of connections created from Spark. By default the connector creates a new connection for each consumer. |
102-
25+
`mvn clean install` to build connector with integration tests
26+
`mvn clean install -DskipTests ` to build connector without integration tests.
10327

10428
# Exploring the Code using IDE
10529

0 commit comments

Comments
 (0)