Skip to content

[BUG] Seeing some backlog pending on Pulsar UI even if Spark consumer has consumed all data. #176

Open
@akshay-habbu

Description

@akshay-habbu

[ Disclaimer - I am fairly new with Pulsar so I might not understand all the pulsar details but I have been using spark from a while now. ]
I am using Apache Spark consumer for consuming data from Pulsar on AWS EMR. I am using steamnative pulsar-spark connector.
my version stack looks like this
Spark Version- 3.4.1
Pulsar Version- 2.10.0.7
streamnative connector - pulsar-spark-connector_2.12-3.4.0.3.jar

I have created a new pulsar topic and started a fresh spark consumer on that topic, the consumer is able to connect to the topic and consume messages correctly. the only issue I have is with the backlog numbers displayed on the pulsar admin UI.
Screenshot 2024-04-03 at 7 59 40 PM

To Reproduce
Steps to reproduce the behavior:
Create a spark consumer using following code

val spark = SparkSession.builder
  .appName("pulsar_streaming_test_app")
  .enableHiveSupport()
  .getOrCreate()
spark.sparkContext.setLogLevel("WARN")

val optionsMap: mutable.Map[String, String] = mutable.Map[String, String]()
optionsMap.put("service.url", "pulsar://pulsar-service.url:6650")
optionsMap.put("admin.url", "pulsar://pulsar-admin.url:8080")
optionsMap.put("pulsar.producer.batchingEnabled", "false")
optionsMap.put("topic", "topic-name")
optionsMap.put("predefinedSubscription", "existing-subscription-name")
optionsMap.put("subscriptionType", "Exclusive/Shared")
optionsMap.put("startingOffsets", "latest")

val data = spark.readStream.format("pulsar").options(optionsMap).load()

data.writeStream
  .format("parquet")
  .option("checkpointLocation", "checkpoint/path")
  .option("path", "output/path")
  .start()
  .awaitTermination()

Also there is a side problem not very important but seems like spark does not create new subscription on its own, the job keeps on failing with

Caused by: org.apache.pulsar.client.api.PulsarClientException: {"errorMsg":"Subscription does not exist","reqId":1663032428812969942, "remote":"pulsar-broker-21/172.31.203.70:6650", "local":"/ip:46010"}

The only way I make it work is by creating a subscription manually on pulsar end and using predefinedSubscription option in spark to latch on to that subscription
I tried passing pulsar.reader.subscriptionName, pulsar.consumer.subscriptionName, subscriptionName while running job but it failed with same error.

Any help would be much appreciated.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions