-
Notifications
You must be signed in to change notification settings - Fork 3.8k
HttpRemoteTaskRunner enhancements #18851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
d6dc9a2 to
6cc5303
Compare
|
Thanks for creating this PR, @jtuglu1 ! The patch seems much simpler now. |
6cc5303 to
0deca3a
Compare
| if (taskItem.getState() == HttpRemoteTaskRunnerWorkItem.State.PENDING_WORKER_ASSIGN) { | ||
| taskItem.revertStateFromPendingWorkerAssignToPending(); | ||
| } | ||
| if (!runTaskOnWorker(taskId, workerHost)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick note: @kfaraz I was tempted to place this inside the above .compute() to avoid the race, but given it could be holding the lock for a while (and concurrent hashmap locks a range of keys, not necessarily just the key), I opted to unlock (similar to what the existing runner does). Additionally, if something like a removeWorker callback occurs, I'd prefer to update the state right away, rather than wait for an assignment timeout, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Could you add a 1-line comment on why it is okay to do this.
| throw new IAE("Invalid host and port: [%s]", colonIndex); | ||
| } | ||
| workerHost = hostAndPort.substring(0, colonIndex); | ||
| workerPort = Integer.parseInt(hostAndPort.substring(colonIndex + 1)); |
Check notice
Code scanning / CodeQL
Missing catch of NumberFormatException Note test
0deca3a to
006a079
Compare
006a079 to
f1b210a
Compare
kfaraz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving a partial review, will try to finish going through the rest of the changes today.
| * ZK_CLEANUP_TODO : As of 0.11.1, it is required to cleanup task status paths from ZK which are created by the | ||
| * workers to support deprecated RemoteTaskRunner. So a method "scheduleCompletedTaskStatusCleanupFromZk()" is added' | ||
| * which should be removed in the release that removes RemoteTaskRunner legacy ZK updation WorkerTaskMonitor class. | ||
| * HTTP-based distributed task scheduler that manages assignment of tasks to slots on workers (MiddleManagers). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * HTTP-based distributed task scheduler that manages assignment of tasks to slots on workers (MiddleManagers). | |
| * HTTP-based distributed task scheduler that manages assignment of tasks to slots on workers (MiddleManagers or Indexers). |
|
|
||
| log.debug( | ||
| "Worker[%s] wrote [%s] status for task [%s] on [%s]", | ||
| "Worker[%s] wrote status[%s] for task[%s] on [%s]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "Worker[%s] wrote status[%s] for task[%s] on [%s]", | |
| "Worker[%s] wrote status[%s] for task[%s] on location[%s]", |
| private final DruidNodeDiscoveryProvider druidNodeDiscoveryProvider; | ||
| private final HttpClient httpClient; | ||
| private final ObjectMapper smileMapper; | ||
| private final ObjectMapper objectMapper; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the factory class, we are still passing @Smile annotated mapper here. Let's retain the original name of this field to avoid ambiguity with the other mapper (typically injected without an annotation or with @Json).
| ); | ||
| } | ||
|
|
||
| private boolean runTaskOnWorker( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a javadoc to this method, which clarifies what the return value/excpetion in this method means.
- true: successfully assigned
- false: failed to assign, retry later
- exception: failed to assign, mark task as completed
I wonder if this tri-state should be captured in the return value itself rather than throwing varying exceptions like ISE, IllegalState (using Preconditions). You could return a value which has a success boolean, a retry boolean and a String failureReason, which can be used for alerting/task completion message.
| * Task state machine is as follows: | ||
| * 1. PENDING – Task has been submitted to the scheduler. | ||
| * 2. PENDING_ASSIGN – Task has been assignment to a worker, but has not started running yet. | ||
| * 3. EXECUTING – Task is running on a worker. | ||
| * 4. COMPLETE – Task has completed (success/fail). | ||
| * Worker state machine is as follows: | ||
| * 1. READY – Worker is online and ready to receive new tasks. | ||
| * 2. PENDING_ASSIGN – A task has been submitted to this worker, but has not started running yet. | ||
| * 3. BLACKLISTED – Worker has too many failed tasks. | ||
| * 4. LAZY – Worker has no more task running and been marked as reapable by the worker auto-scaler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed here and should be moved to the respective enum values.
Just link them here using @see WorkerHolder.State and @see HttpRemoteTaskRunnerWorkItem.State
| "The state of the new task complete event is different from its last known state. " | ||
| + "New state[%s], last known state[%s]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "The state of the new task complete event is different from its last known state. " | |
| + "New state[%s], last known state[%s]", | |
| "Ignoring update to status[%s] as task[%s] has already completed with status[%s].", |
| log.info( | ||
| "Worker[%s] completed task[%s] with status[%s]", | ||
| workerHolder.getWorker().getHost(), | ||
| taskStatus.getId(), | ||
| taskStatus.getStatusCode() | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can be 1-lined, I feel.
| log.info( | |
| "Worker[%s] completed task[%s] with status[%s]", | |
| workerHolder.getWorker().getHost(), | |
| taskStatus.getId(), | |
| taskStatus.getStatusCode() | |
| ); | |
| log.info("Worker[%s] completed task[%s] with status[%s].", workerHost, taskId, taskStatus); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, should we keep this log line outside the compute? It doesn't need to be inside the compute block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I think that would help moderately with performance, since we're using key-based locking it's not likely to help that much. I can do it, but also nice way (in the logs) to validate that nothing else was modifying this simulatenously.
| workerHolder.setLastCompletedTaskTime(DateTimes.nowUtc()); | ||
| blacklistWorkerIfNeeded(taskStatus, workerHolder); | ||
| } else { | ||
| log.warn("Could not find worker[%s]", workerHost); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really a warning? It is possible that the task finished on the worker and then the worker went away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's not unexpected behavior, but it's certainly not normal behavior (these are MMs, not Peons). I can change to info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Could you also update the log message to include more info e.g. Could not find worker[%s] while marking task[%s] as complete?
| if (workerHolder != null) { | ||
| blacklistWorkerIfNeeded(taskStatus, workerHolder); | ||
| if (workerHost != null) { | ||
| synchronized (workerStateLock) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem like we need to acquire this lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use the workerStateLock as a global synchronization point for the worker state. This ensures that scheduling loop doesn't accidentally schedule a task on a worker that is being killed, just been scheduled, etc. It also helps notify sleeping scheduling threads that potential slots are ready due to a worker state change. In this case, a worker has completed a task; this frees up a slot so we should notify interested parties.
| workers.forEach((workerHost, workerEntry) -> { | ||
| try { | ||
| workerEntry.waitForInitialization(); | ||
| } | ||
| catch (InterruptedException e) { | ||
| Thread.currentThread().interrupt(); | ||
| } | ||
| }); | ||
| log.info("Workers have synced state successfully."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think changing from the regular for loop to a forEach makes much of a difference. It is still possible for new items to get added to the workers map after we started the forEach iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, with this one, we ensure that acquiring a key reference (while iterating through it instead of keySet), we also get a synchronized scope to the value to do what we want. I found this to be a better model that looping through a keySet, and then indexing with those (potentially stale) keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see.
Can that ever cause a deadlock?
Say a worker is yet to be initialized and a thread is waiting on waitForInitialization() while holding the lock for the corresponding key in the map. Will any other thread be able to mark this worker as "initialized" without acquiring that lock?
I found this to be a better model that looping through a keySet, and then indexing with those (potentially stale) keys.
Hmm, I was hoping we use .keySet() followed by .compute() itself and move the .compute() part into a separate method (similar to how TaskQueue does it). We could use that method wherever we add/update the entry for a worker so that we always perform actions using a compute() ensuring thread safety.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, .forEach() doesn't seem to be mutually exclusive with .compute() on the same key.
Could you please double check this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished going through the bulk of the changes.
On the whole, the patch looks good. I have these major suggestions:
- For the time being, it would be cleaner to use
workerStateLockconsistently whenever accessing theworkersmap. We can try to improve this later. - Avoid use of
.forEach()and use.compute()instead, preferably encasing it in anaddOrUpdatemethod similar toTaskQueue. - Do not perform any heavy operation like metadata store access, metric emission, listener notification, etc. inside the
.compute()lambda. - Avoid throwing exceptions inside the lambda, if they are just to be caught back in the same method/loop. Instead, log an error and continue with the loop.
- Remove the priority scheduling changes for now.
- Reduce debug logging.
| cancelWorkerCleanup(worker.getHost()); | ||
|
|
||
| // There cannot be any new tasks assigned to this worker as the entry has not been published yet. | ||
| // That being said, there can be callbacks in taskAddedOrUpdated() where some task suddenly begins running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // That being said, there can be callbacks in taskAddedOrUpdated() where some task suddenly begins running | |
| // However, there can be callbacks in taskAddedOrUpdated() where some task suddenly begins running |
| log.info("Adding worker[%s]", worker.getHost()); | ||
| synchronized (workerStateLock) { | ||
| workers.compute( | ||
| worker.getHost(), (key, workerEntry) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
| worker.getHost(), (key, workerEntry) -> { | |
| worker.getHost(), | |
| (key, workerEntry) -> { |
| // It might be a worker that existed before, temporarily went away and came back. We might have a set of | ||
| // tasks that we think are running on this worker. Provide that information to WorkerHolder that | ||
| // manages the task syncing with that worker. | ||
| tasks.forEach((taskId, taskEntry) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iterating through all tasks here seems problematic. The original code was doing this too, but now with the introduction of two separate locks and two separate concurrent hash maps, we need to take extra care that we don't end up in a deadlock.
Maybe build the expected announcements before acquiring the workerStateLock, since the announcements deal with only task state and not worker state.
|
|
||
| // All discovered workers, "host:port" -> WorkerHolder | ||
| private final ConcurrentMap<String, WorkerHolder> workers = new ConcurrentHashMap<>(); | ||
| private final ConcurrentHashMap<String, WorkerHolder> workers = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be @GuardedBy("workerStateLock") since it is always meant to be accessed within that lock.
| tasksToFail.add(e.getValue()); | ||
| } | ||
| final Set<HttpRemoteTaskRunnerWorkItem> tasksToFail = new HashSet<>(); | ||
| tasks.forEach((taskId, taskEntry) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that .forEach() since it is not mutually exclusive with .compute() and essentially just provides iteration over a snapshot. But I suppose it is okay here?
| setStateUnconditionally(state); | ||
| } | ||
|
|
||
| public void revertStateFromPendingWorkerAssignToPending() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we don't need this anymore since we simply add the task item back to the queue, right?
| } | ||
| if (!runTaskOnWorker(taskId, workerHost)) { | ||
| log.warn("Failed to assign task[%s] to worker[%s]. Sending to back of queue", taskId, workerHost); | ||
| pendingTasks.put(taskItem.withFreshSequenceNumber()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The task entry in the tasks map would now have state PENDING_WORKER_ASSIGN. Do we need to revert it back to PENDING (same as old code)?
| tasks.compute( | ||
| taskId, | ||
| (key, taskEntry) -> { | ||
| if (taskEntry == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this entire lambda into a new method?
| (key, taskEntry) -> { | ||
| if (taskEntry == null) { | ||
| // Try to find information about it in the TaskStorage | ||
| Optional<TaskStatus> knownStatusInStorage = taskStorage.getStatus(taskId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe perform this before the .compute()?
|
|
||
| final ServiceMetricEvent.Builder metricBuilder = new ServiceMetricEvent.Builder(); | ||
| IndexTaskUtils.setTaskDimensions(metricBuilder, taskEntry.getTask()); | ||
| emitter.emit(metricBuilder.setMetric( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do metric emission, listener notification, etc. outside the .compute() block.
Description
Draft. Clone of #18729 but merged into current runner per @kfaraz request.
I've seen on the giant lock in
HttpRemoteTaskRunnercause severe performance degradation under heavy load(200-500ms per acquisition with 1000s of activeTasks can slow down the startPendingTasks loop in TaskQueue). This leads to scheduling delays, which leads to more lag, which auto-scales more tasks, ..., etc. The runner also has a few (un)documented races abundant in the code. This overhead also slows down query tasks under load (e.g. MSQE and others) which utilize the scheduler for execution.I'm attempting a rewrite of this class to optimize for throughput and safety.
Apart from the performance improvements/bug fixes, this will also include some new features:
I would ultimately like to make this the default
HttpRemoteTaskRunnerand have it run in all tests/production clusters, etc. as I think that would help catch more bugs/issues.Performance Testing
Test results thus far have shown ~100-300ms speed up per task runner operation (
add(), etc.). Over 1000s of tasks, this amounts to minutes of delay saved.Release note
Key changed/added classes in this PR
MyFooOurBarTheirBazThis PR has: