Skip to content

Incorrect invocation order of job execution listener in remote partitioning #4133

Open
@kianjavadi

Description

@kianjavadi

Bug description
I'm using remote-partitioning with Kafka as the middleware. I have one manager and three workers. accordingly, one partition has been assigned for the manager's input topic and three partitions have been assigned for the worker's input.

the manager takes a job, creates multiples ExecutionContexts and sends those over Kafka. workers start processing the respective steps and send the message at the end of their process. manager will aggregate the worker's results and decide to complete the job if all workers are done. so far so good.

now assume first I run a long-running job that requires lots of time to finish and then I run a small job that finishes quickly. not surprisingly the second job finishes sooner and sends a completed signal, the manager consumes this message and continues the process. I even checked AggregatingMessageHandler, the completed message is related to the second job (short-running one) only, I checked the jobExecutionId

now the problem happens, I have a JobListener that has an afterJob method. this method will be run against the first job (the long-running one that is still being processed by workers), not the second one (the short-running one that a completed signal has been sent for it)! I can say this by looking at the jobExecutionId. it's really weird because I never saw in the logs that there's a completion signal for the first job.

after some time and whenever the first long-running job is finished, the final worker sends a completed message and the manager decides to finish the job, now the JobListener is run against the second job (short-running one)!

I couldn't understand what goes wrong? I would like to assume that probably it's a miss-configuration, but by debugging the code and checking AggregatingMessageHandler and TRACE logs in the workers and manager, I can clearly see that the messages are being sent fine and there's nothing wrong with the messages.

Environment
I'm using spring-batch along with spring-boot 2.5.6, Java 11, MySQL 8, Kafka 3

Steps to reproduce
here is a sample project that you can run, instructions on how to run a job are in the readme file
first, you can run a long job: POST http://localhost:28080/job?minId=10&maxId=28
and then you can run a short job: POST http://localhost:28080/job?minId=28&maxId=30
the shorter job is going to be finished sooner, but despite the older job still running in the workers, you can see in the manager's logs the JobListener is triggered for the older job.

Expected behavior
since the completion signal belongs to the second job, then the JobListener also must be triggered for the second job, not the older one

(I've also asked the same question on StackOverflow a couple of days ago. since there's no response on it, I decided to mention it here)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions