Description
[ Disclaimer - I am fairly new with Pulsar so I might not understand all the pulsar details but I have been using spark from a while now. ]
I am using Apache Spark consumer for consuming data from Pulsar on AWS EMR. I am using steamnative pulsar-spark connector.
my version stack looks like this
Spark Version- 3.4.1
Pulsar Version- 2.10.0.7
streamnative connector - pulsar-spark-connector_2.12-3.4.0.3.jar
I have created a new pulsar topic and started a fresh spark consumer on that topic, the consumer is able to connect to the topic and consume messages correctly. the only issue I have is with the backlog numbers displayed on the pulsar admin UI.
To Reproduce
Steps to reproduce the behavior:
Create a spark consumer using following code
val spark = SparkSession.builder
.appName("pulsar_streaming_test_app")
.enableHiveSupport()
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
val optionsMap: mutable.Map[String, String] = mutable.Map[String, String]()
optionsMap.put("service.url", "pulsar://pulsar-service.url:6650")
optionsMap.put("admin.url", "pulsar://pulsar-admin.url:8080")
optionsMap.put("pulsar.producer.batchingEnabled", "false")
optionsMap.put("topic", "topic-name")
optionsMap.put("predefinedSubscription", "existing-subscription-name")
optionsMap.put("subscriptionType", "Exclusive/Shared")
optionsMap.put("startingOffsets", "latest")
val data = spark.readStream.format("pulsar").options(optionsMap).load()
data.writeStream
.format("parquet")
.option("checkpointLocation", "checkpoint/path")
.option("path", "output/path")
.start()
.awaitTermination()
Also there is a side problem not very important but seems like spark does not create new subscription on its own, the job keeps on failing with
Caused by: org.apache.pulsar.client.api.PulsarClientException: {"errorMsg":"Subscription does not exist","reqId":1663032428812969942, "remote":"pulsar-broker-21/172.31.203.70:6650", "local":"/ip:46010"}
The only way I make it work is by creating a subscription manually on pulsar end and using predefinedSubscription
option in spark to latch on to that subscription
I tried passing pulsar.reader.subscriptionName
, pulsar.consumer.subscriptionName
, subscriptionName
while running job but it failed with same error.
Any help would be much appreciated.
Activity