-
Notifications
You must be signed in to change notification settings - Fork 68
Optimize Process list by batching #6831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
c686e82 to
34f2e4f
Compare
|
Another optimization to inspect in general: When filtering for tasks and their state we join the When filtering by task name and state the query constructed involves joining a potentially very large SELECT process
FROM Process AS process
INNER JOIN process.tasks AS task
WITH task.processingStatus = :queryObject
AND task.title = :userFilter2
WHERE process.project.client.id = :sessionClientId
AND process.id NOT IN (:id)
AND process.id IN (:userFilter1query1)
AND process.id IN (:userFilter1query2)
AND (process.sortHelperStatus IS NULL OR process.sortHelperStatus != :completedState)
AND process.project.id IN (:projectIDs)
ORDER BY process.id ASCbased on the logic defined here. kitodo-production/Kitodo/src/main/java/org/kitodo/production/services/data/FilterField.java Lines 45 to 48 in 75ed87a
I think for tasks we can employ EXISTS queries as well which are more efficient. We only want to answer the question whether a process has tasks with that attributes or not, so query could be something like this: SELECT process
FROM Process AS process
WHERE process.project.client.id = :sessionClientId
AND process.id NOT IN (:id)
AND process.id IN (:userFilter1query1)
AND process.id IN (:userFilter1query2)
AND (process.sortHelperStatus IS NULL
OR process.sortHelperStatus != :completedState)
AND process.project.id IN (:projectIDs)
AND EXISTS (
SELECT 1
FROM Task task
WHERE task.process = process
AND task.processingStatus = :queryObject
AND task.title = :userFilter2
)
ORDER BY process.id ASC |
f6728a2 to
6c133ff
Compare
6c133ff to
d52fe8c
Compare
d52fe8c to
105d7c5
Compare
|
Selecting or unselecting also triggers a lot of queries. The more processes are selected, the more queries are triggered. Maybe we can also cache the rowdata which is retrieved anew (for all seleced rows) whenever a row selection is made: @Override
public Object getRowData() {
Stopwatch stopwatch = new Stopwatch(this, "getRowData");
List<Object> data = getWrappedData();
if (isRowAvailable()) {
return stopwatch.stop(data.get(getRowIndex()));
} else {
return stopwatch.stop(null);
}
} |
This Pull Request tries to further optimize the process list in Kitodo.Production. The changes address bottlenecks which were identified earlier while preserving existing behavior (See #6649 (comment))
The linked issue identified the following bottlenecks which all stem from executing SQL logic for each process in the list. (100 times on a list with max size)
tasks0_ ... WHERE process_id=?t.processingStatus ... WITH RECURSIVE process_childrenprocess0_.id ... parent_id=?comments0_ ... JOIN userbatches0_.process_id IN (...)The queries identified there can all be made more efficient by executing them only once for all processes, caching the result and reusing the cached result for the view.
The first optimization extends an idea introduced in #5360 (see esp. #5360 (comment)). In order to recursively calculate the progress for all processes in the list (including parents) we rely on native SQL queries which are now supported by current versions of MySQL and MariaDB. The changes here go one step further and recursively calculate the progress for all processes in the list at once.
The second optimization is directed at the calculation of the task title of open/in work tasks of a process, which is used in a tooltip in the list. We can use default HQL to retrieve the information for all processes at once and cache it for reuse in the view. The same is true for identifying all processes with children, which can also be done in one batch query.
The same general pattern has also been applied in another PR to optimize the user list (#6803): Calculate the values for all processes in the derived
LazyBeanModelfor this view and store them in a HashMap which serves as a cache, which is accessed by the view.To asses whether this actually improves on performance maybe @solth or @henning-gerhardt can give it a try.