When scanning a large repository containing a Maven build with many subprojects, I noticed that the FileArchiver was invoked for each subproject. The repository was then cloned, an archive was created, and the archive was written to the storage.
I have created #11483 to prevent that the repository is checked out again and again for each project. However, from looking at the code in FileArchiver, it seems that this scenario is handled in a rather strange way:
The FileArchiver.archive() function creates an archive from the root of the checked out repository whose exact content may depend on the passed in package ID (the project ID in this case). The archive is then passed to the storage and associated with the provenance. So, when having a provenance with multiple packages, IIUC, the archiver creates multiple archives which override each other in the storage. This is even more problematic since the archives may have different content, depending on the package ID.
I assume, there should be only a single archive per provenance; or when storing archives, the package ID would need to be taken into account.
When scanning a large repository containing a Maven build with many subprojects, I noticed that the
FileArchiverwas invoked for each subproject. The repository was then cloned, an archive was created, and the archive was written to the storage.I have created #11483 to prevent that the repository is checked out again and again for each project. However, from looking at the code in
FileArchiver, it seems that this scenario is handled in a rather strange way:The
FileArchiver.archive()function creates an archive from the root of the checked out repository whose exact content may depend on the passed in package ID (the project ID in this case). The archive is then passed to the storage and associated with the provenance. So, when having a provenance with multiple packages, IIUC, the archiver creates multiple archives which override each other in the storage. This is even more problematic since the archives may have different content, depending on the package ID.I assume, there should be only a single archive per provenance; or when storing archives, the package ID would need to be taken into account.