Skip to content

feat(services): Add a handler for orphaned snippets #2688

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

wkl3nk
Copy link
Contributor

@wkl3nk wkl3nk commented May 9, 2025

Introduce a periodic cleanup mechanism to remove Snippets that are no longer associated with any SnippetFindings. This helps prevent unnecessary data accumulation and keeps the database clean. Being able to define a limit and a chunkSize allows to make sure that there is only little risk that the memory blows up when orphan snippets are looked up and that database limits are reached in the delete statement concerning the maximal allowed length of a SQL statement and the maximum number of operations allowed in a single database transaction.

Looking at the tests, Exposed creates the following SQL statements to find Snippet orphans (tests have a limit setting of 10):

SELECT snippets.id FROM snippets WHERE NOT EXISTS (
   SELECT 1 FROM snippet_findings_snippets WHERE snippet_findings_snippets.snippet_id = snippets.id
) LIMIT 10"

and for deletion (in a chunked way):


DELETE FROM snippets WHERE snippets.id IN (101, 102, 103, 104, 105)

Introduce a periodic cleanup mechanism to remove snippets that are
no longer associated with any SnippetFindings. This helps prevent
unnecessary data accumulation and keeps the database clean.

Signed-off-by: Wolfgang Klenk <[email protected]>
@oheger-bosch
Copy link
Contributor

The purpose of this feature is to clean up the whole data structures related to snippets. This includes snippet findings that are no longer associated with an ORT run. I assume, in its current form, the task won't delete anything because under normal circumstances, there are no snippets not associated with a snippet finding. Or am I missing something? At least, I cannot find any DELETE ON CASCADE constraints on tables related to snippets.

@wkl3nk
Copy link
Contributor Author

wkl3nk commented May 12, 2025

Hello @oheger-bosch ,

in its current form, the task won't delete anything because under normal circumstances, there are no snippets not associated with a snippet finding.

I don't know if I understand what you mean, but in the Bosch production environment, this will delete > 3.000.000 snippets.
You can cross check if one of these SQL statements:

SELECT COUNT(snippets.id) 
FROM ort_server.snippets 
WHERE NOT EXISTS (
    SELECT 1 
    FROM ort_server.snippet_findings_snippets 
    WHERE snippet_findings_snippets.snippet_id = snippets.id
) 

Returns 3372947

SELECT COUNT(s.id) 
FROM ort_server.snippets s
LEFT JOIN ort_server.snippet_findings_snippets sfs on sfs.snippet_id = s.id
WHERE sfs.snippet_id IS NULL

Returns 3372947

At least, I cannot find any DELETE ON CASCADE constraints on tables related to snippets.

As the SNIPPET_FINDINGS_SNIPPETS is a n:m intermediate table, entries in this table should be deleted whenever a row from SNIPPET_FINDINGS is deleted. But you cannot "automatically" also delete the SNIPPETS. This is the reasons why these orphans get into existence. But I think I probably got you wrong.

How to continue? Do you think there is something missing?

@oheger-bosch
Copy link
Contributor

Hi @wkl3nk,

it may well be the case that I miss something. But my assumption was that the whole data structures related to snippets and snippet findings are not yet handled when it comes to data retention. Is this correct? If so, I do not understand why there can be snippets not assigned to snippet findings. Maybe there was some manual cleanup done?

But in normal operation, one would have to delete snippet findings first before the snippets can be removed, right?

@wkl3nk
Copy link
Contributor Author

wkl3nk commented May 12, 2025

@oheger-bosch I see clearer now:

But my assumption was that the whole data structures related to snippets and snippet findings are not yet handled when it comes to data retention. Is this correct?

This seems correct: Snippet findings, like copyright findings and license findings act as kind of cache for the scanner and are intentionally not deleted after a period of time.

If so, I do not understand why there can be snippets not assigned to snippet findings. Maybe there was some manual cleanup done?

I agree, there must have taken place a manual clean up in our database.

But in normal operation, one would have to delete snippet findings first before the snippets can be removed, right?

Yes. Currently the snippet findings are seen as part of the cache. But as long as this cache is not removed after some time, there should come no snippet orphans into existence.

Alt least this PR will clean up about 3.000.000 of total around 17.000.000 snippets.

@oheger-bosch
Copy link
Contributor

@wkl3nk,

caching is handled differently between copyright and snippet scanners. The cache is mainly used for results from ScanCode to circumvent the long scanning times of this scanner. For FossID in contrast, scan results are not cached, but the scanner is triggered for each run that is configured to use FossID. (Whether a scanner uses the scan storage for caching or not, is determined by a property in the implementation class.)

So, for FossID, this means that corresponding findings could be deleted together with their runs. They are not used or referenced elsewhere.

Copy link
Contributor

@oheger-bosch oheger-bosch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with merging this; I think, this can still be part of a solution that fully addresses the problem of orphaned snippet data.

@oheger-bosch
Copy link
Contributor

Note, there is still the related PR #2146. Any idea how to continue with this one?

@wkl3nk wkl3nk added this pull request to the merge queue May 14, 2025
Merged via the queue into eclipse-apoapsis:main with commit d1ae152 May 14, 2025
30 checks passed
@wkl3nk wkl3nk deleted the wkl3nk/delete-orphan-snippets branch May 14, 2025 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants