-
Notifications
You must be signed in to change notification settings - Fork 530
Description
[edit: the very last paragraph appears to point in the right direction; see the next comment also]
Investigating this harvested performance issue in IQSS prod.:
Harvesting a new dataset is generally fast (some fraction of a sec./dataset);
Re-harvesting that same dataset the performance is much (10s of times) worse.
This must somehow be a function of the overall database size; since it's not observable w/ small test databases.
It is reproducible on the perf. system w/ a clone of the prod. database (even slower there).
It does not appear to be a function of the number of datasets in the collection (observable in a collection with very few harvested datasets).
I made a PR a year ago specifically to make re-harvesting cheaper. It is absolutely possible that something I did there backfired and made things worse.
One extra confusing detail though is that it also takes forever to delete harvested datasets when deleting a client (appears to take exactly as long to delete as it does to re-harvest/update these datasets). Harvested datasets are removed via a cascade in that scenario; I did not touch that part in 10836.
I don't know yet for sure if the issue is in the database update or delete, or in solr reindexing. But, FWIW, when deleting a client, the dataset cards in the collection disappear almost instantly, and then it takes forever for the dataset objects to disappear from the db.
I'm looking into this since this is an effective blocker for a few large remote collections that need to be re-harvested.
(to be clear, I have no solid evidence yet of this being unique to harvested objects; it's just that with harvesting you often have real life cases where you need to update large numbers - hundreds or even thousands - of datasets all at once)