-
Notifications
You must be signed in to change notification settings - Fork 27
Compress data files to save space #94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
zip from wheels is pretty weak for compression indeed.
I wonder if we can get even better using lzma which is builtin since 3.3?
Also, I would prefer avoiding having the compressed json in Git if at all possible... to keep proper diffs and keep the repo as small as can be.
What about this:
- modify the code to accept either the json or lzma compressed input
- add the compressed version to .gitignore
- update the build to use flot https://github.com/aboutcode-org/flot with a small prebuild script that will do the compression as part of the build
That all sounds reasonable but I don't have time at the moment (this version was super easy 😅), I can try to make those changes next week, or someone else can take over. |
Actually in the context of https://discuss.python.org/t/pep-639-round-3-improving-license-clarity-with-better-package-metadata/53020/1 I think we can do better. We can build a minimal license-expression-mini wheel that would contain a subset of the license data ... say just the essential license keys in a list of tuples with no keys.
It would be down to 23K of compressed data :) |
@jamestwebber gentle ping... do you still want to complete this? |
I'm afraid not--I'm too busy right now. |
@jamestwebber I have poundered this more and more and there are user that depend on the JSON data available as-in in the path... not great, but hey. We could draft instead a smaller data that has just the license symbols and not much else. But then again, this may be an issue for some. The other way could be to move the data to a separate wheel and wha a small and a normal version |
This PR was motivated by a discussion about PEP 639 which might recommend using this package in build tools. In that context, package size is a big concern.
The package is about 1.2 MB installed, and the majority of that is due to
scancode-licensedb-index.json
. I just gzipped the data file and modified the code appropriately to save space--the json compresses to <10% of its original size and the tests all pass.