-
Notifications
You must be signed in to change notification settings - Fork 14
feat(services): Add a handler for orphaned snippets #2688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(services): Add a handler for orphaned snippets #2688
Conversation
Introduce a periodic cleanup mechanism to remove snippets that are no longer associated with any SnippetFindings. This helps prevent unnecessary data accumulation and keeps the database clean. Signed-off-by: Wolfgang Klenk <[email protected]>
The purpose of this feature is to clean up the whole data structures related to snippets. This includes snippet findings that are no longer associated with an ORT run. I assume, in its current form, the task won't delete anything because under normal circumstances, there are no snippets not associated with a snippet finding. Or am I missing something? At least, I cannot find any DELETE ON CASCADE constraints on tables related to snippets. |
Hello @oheger-bosch ,
I don't know if I understand what you mean, but in the Bosch production environment, this will delete > 3.000.000 snippets. SELECT COUNT(snippets.id)
FROM ort_server.snippets
WHERE NOT EXISTS (
SELECT 1
FROM ort_server.snippet_findings_snippets
WHERE snippet_findings_snippets.snippet_id = snippets.id
)
Returns 3372947
SELECT COUNT(s.id)
FROM ort_server.snippets s
LEFT JOIN ort_server.snippet_findings_snippets sfs on sfs.snippet_id = s.id
WHERE sfs.snippet_id IS NULL
Returns 3372947
As the SNIPPET_FINDINGS_SNIPPETS is a n:m intermediate table, entries in this table should be deleted whenever a row from SNIPPET_FINDINGS is deleted. But you cannot "automatically" also delete the SNIPPETS. This is the reasons why these orphans get into existence. But I think I probably got you wrong. How to continue? Do you think there is something missing? |
Hi @wkl3nk, it may well be the case that I miss something. But my assumption was that the whole data structures related to snippets and snippet findings are not yet handled when it comes to data retention. Is this correct? If so, I do not understand why there can be snippets not assigned to snippet findings. Maybe there was some manual cleanup done? But in normal operation, one would have to delete snippet findings first before the snippets can be removed, right? |
@oheger-bosch I see clearer now:
This seems correct: Snippet findings, like copyright findings and license findings act as kind of cache for the scanner and are intentionally not deleted after a period of time.
I agree, there must have taken place a manual clean up in our database.
Yes. Currently the snippet findings are seen as part of the cache. But as long as this cache is not removed after some time, there should come no snippet orphans into existence. Alt least this PR will clean up about 3.000.000 of total around 17.000.000 snippets. |
caching is handled differently between copyright and snippet scanners. The cache is mainly used for results from ScanCode to circumvent the long scanning times of this scanner. For FossID in contrast, scan results are not cached, but the scanner is triggered for each run that is configured to use FossID. (Whether a scanner uses the scan storage for caching or not, is determined by a property in the implementation class.) So, for FossID, this means that corresponding findings could be deleted together with their runs. They are not used or referenced elsewhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with merging this; I think, this can still be part of a solution that fully addresses the problem of orphaned snippet data.
Note, there is still the related PR #2146. Any idea how to continue with this one? |
Introduce a periodic cleanup mechanism to remove Snippets that are no longer associated with any SnippetFindings. This helps prevent unnecessary data accumulation and keeps the database clean. Being able to define a
limit
and achunkSize
allows to make sure that there is only little risk that the memory blows up when orphan snippets are looked up and that database limits are reached in the delete statement concerning the maximal allowed length of a SQL statement and the maximum number of operations allowed in a single database transaction.Looking at the tests, Exposed creates the following SQL statements to find Snippet orphans (tests have a limit setting of 10):
and for deletion (in a chunked way):