Skip to content

Conversation

@BartChris
Copy link
Collaborator

@BartChris BartChris commented Dec 19, 2025

This Pull Request tries to further optimize the process list in Kitodo.Production. The changes address bottlenecks which were identified earlier while preserving existing behavior (See #6649 (comment))

The linked issue identified the following bottlenecks which all stem from executing SQL logic for each process in the list. (100 times on a list with max size)

Query (simplified) Executions Avg (ms) Max (ms) Total (s) % of DB time
tasks0_ ... WHERE process_id=? 100 0.45 0.95 0.045 42%
t.processingStatus ... WITH RECURSIVE process_children 100 0.27 0.49 0.027 25%
process0_.id ... parent_id=? 100 0.19 0.36 0.019 18%
comments0_ ... JOIN user 3 0.59 0.81 0.0018 1.7%
batches0_.process_id IN (...) 3 0.19 0.27 0.0006 0.5%
Other queries ~10 ~0.015 ~13%

The queries identified there can all be made more efficient by executing them only once for all processes, caching the result and reusing the cached result for the view.

The first optimization extends an idea introduced in #5360 (see esp. #5360 (comment)). In order to recursively calculate the progress for all processes in the list (including parents) we rely on native SQL queries which are now supported by current versions of MySQL and MariaDB. The changes here go one step further and recursively calculate the progress for all processes in the list at once.

The second optimization is directed at the calculation of the task title of open/in work tasks of a process, which is used in a tooltip in the list. We can use default HQL to retrieve the information for all processes at once and cache it for reuse in the view. The same is true for identifying all processes with children, which can also be done in one batch query.

The same general pattern has also been applied in another PR to optimize the user list (#6803): Calculate the values for all processes in the derived LazyBeanModel for this view and store them in a HashMap which serves as a cache, which is accessed by the view.

To asses whether this actually improves on performance maybe @solth or @henning-gerhardt can give it a try.

@BartChris BartChris force-pushed the process_list_batching branch 14 times, most recently from c686e82 to 34f2e4f Compare December 21, 2025 02:11
@BartChris
Copy link
Collaborator Author

BartChris commented Dec 22, 2025

Another optimization to inspect in general: When filtering for tasks and their state we join the task table, what is probably not strictly necessary.

When filtering by task name and state the query constructed involves joining a potentially very large task table and usually looks like this:

SELECT process
FROM Process AS process
INNER JOIN process.tasks AS task
  WITH task.processingStatus = :queryObject
 AND task.title = :userFilter2
WHERE process.project.client.id = :sessionClientId
  AND process.id NOT IN (:id)
  AND process.id IN (:userFilter1query1)
  AND process.id IN (:userFilter1query2)
  AND (process.sortHelperStatus IS NULL OR process.sortHelperStatus != :completedState)
  AND process.project.id IN (:projectIDs)
ORDER BY process.id ASC

based on the logic defined here.

TASK_READY("tasks AS task WITH task.processingStatus = :queryObject AND task.title",
"~.processingStatus = :queryObject AND ~.title", LikeSearch.NO,
"tasks AS task WITH task.processingStatus = :queryObject AND task.id",
"processingStatus = :queryObject AND id", TaskStatus.OPEN, null, -1),

I think for tasks we can employ EXISTS queries as well which are more efficient. We only want to answer the question whether a process has tasks with that attributes or not, so query could be something like this:

SELECT process
FROM Process AS process
WHERE process.project.client.id = :sessionClientId
  AND process.id NOT IN (:id)
  AND process.id IN (:userFilter1query1)
  AND process.id IN (:userFilter1query2)
  AND (process.sortHelperStatus IS NULL
       OR process.sortHelperStatus != :completedState)
  AND process.project.id IN (:projectIDs)
  AND EXISTS (
      SELECT 1
      FROM Task task
      WHERE task.process = process
        AND task.processingStatus = :queryObject
        AND task.title = :userFilter2
  )
ORDER BY process.id ASC

@BartChris BartChris force-pushed the process_list_batching branch 2 times, most recently from f6728a2 to 6c133ff Compare December 23, 2025 10:18
@BartChris BartChris force-pushed the process_list_batching branch from 6c133ff to d52fe8c Compare December 23, 2025 11:21
@BartChris
Copy link
Collaborator Author

BartChris commented Dec 29, 2025

Selecting or unselecting also triggers a lot of queries. The more processes are selected, the more queries are triggered. Maybe we can also cache the rowdata which is retrieved anew (for all seleced rows) whenever a row selection is made:

@Override
    public Object getRowData() {
        Stopwatch stopwatch = new Stopwatch(this, "getRowData");
        List<Object> data = getWrappedData();
        if (isRowAvailable()) {
            return stopwatch.stop(data.get(getRowIndex()));
        } else {
            return stopwatch.stop(null);
        }
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant