-
-
Notifications
You must be signed in to change notification settings - Fork 56
User-defined compression level #661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I don't know why CI is frozen. :( |
include/zim/compression_levels.h
Outdated
enum class LZMACompressionLevel: int { | ||
MINIMUM = 0, | ||
DEFAULT = 5, | ||
MAXIMUM = 9 | ||
}; | ||
|
||
enum class ZSTDCompressionLevel: int { | ||
MINIMUM = -21, | ||
DEFAULT = 3, | ||
MAXIMUM = 21 | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need this. User will use the value of the corresponding algorithm. It make more sens as we don't interpret the value (just pass it to the compression library).
If we define something, it should be the default value we will use if the user doesn't set the a compressionLevel explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for zimtools: displaying and checking for correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main problem I have with this feature. We should not IMO maintain any kind of "matching" table. Just putting the native compression level is better IMO.
9ed1cc4
to
089f4bc
Compare
Just wondering if other compression methods need support? |
No, we want to keep the supported algorithms set as small as possible to ease the support of the format. |
Ok, got it! |
@data-man Do you plan to fix this PR based on @mgautierfr comments? |
@kelson42 |
9895f91
to
1a52eaf
Compare
@@ -205,6 +205,7 @@ namespace zim | |||
// configuration | |||
bool m_verbose = false; | |||
Compression m_compression = Compression::Zstd; | |||
int m_compressionLevel = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use ZSTDCompressionLevel::DEFAULT
in case a user never call configCompression
if (cl == static_cast<int>(zim::LZMACompressionLevel::MAXIMUM)) { | ||
cl |= LZMA_PRESET_EXTREME; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not change the compression level passed by the user. It should be a simple pass through.
With this, a user would never be able to set a compression level of 9 without LZMA_PRESET_EXTREME
.
@@ -25,6 +25,7 @@ | |||
#include "../debug.h" | |||
#include "../compression.h" | |||
|
|||
#include <zim/zim.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need this include ?
virtual ~Cluster(); | ||
|
||
void setCompression(Compression c) { compression = c; } | ||
void setCompressionLevel(int cl) { compressionLevel = cl; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never use setCompression
. I'm in favor of removing it instead of adding another setCompressionLevel
enum class LZMACompressionLevel: int { | ||
MINIMUM = 0, | ||
MAXIMUM = 9, | ||
DEFAULT = MAXIMUM | ||
}; | ||
|
||
enum class ZSTDCompressionLevel: int { | ||
MINIMUM = -21, | ||
MAXIMUM = 19, | ||
DEFAULT = MAXIMUM | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yet again, I dislike those enums.
19 is not the maximum for zstd algorithm. It is 21 (and it may change in the futur). This is the same for 9 and LZMA.
Those values convey very few information and are pretty useless.
We should keep the default value (as we need to store theme somewhere) but I would go with something more like:
enum class DefaultCompressionLevel: int {
LZMA = 9 | LZMA_PRESET_EXTREME,
ZSTD = 19
};
But we don't want to expose LZMA "symbol" (LZMA_PRESET_EXTREME
) in public zim headers. So we should copy the value and put a explicit comment from where the value come from.
Or we can simply not expose the default value and have two different configCompression
(with and without compression level), the version without could simply set the default value (we would not have to publicly define the default value as we use it internally and never define a default value for a argument)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yet again, I dislike those enums.
@mgautierfr what about something like this (in zim.h
):
int getDefaultCompressionLevel(Compression comp) {
switch(comp) {
case Compression::None:
{
return 0;
break;
}
case Compression::Lzma:
{
return 9 | LZMA_PRESET_EXTREME;
break;
}
case Compression::Zstd:
{
return 19;
break;
}
};
}
And use it in appropriate places and in the zim-tools
.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why putting it in zim.h ? We should use it internally. Either you pass a compression level (and you know which one) or you don't and libzim picks one internally.
And use it in appropriate places and in the zim-tools.
In fact, I'm not sure we want the same thing.
I "only" want to allow the user to pass a custom compression level.
But it seems you want to expose some compression level and use them somehow (in zim-tools).
In this case, we cannot agree as we are not wanted the same thing 🙂
Can you be more explicit on how you want to use the default level in zim-tools ? Maybe you have a branch somewhere you can share.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ zimrecreate
zimrecreate
[ERROR] Not enough Arguments provided
zimrecreate recreates a ZIM file from a existing ZIM.
Usage: zimrecreate ORIGIN_FILE OUTPUT_FILE [Options]
Options:
-v, --version print software version
-j, --withoutFTIndex don't create and add a fulltext index of the content to the ZIM
-J, --threads count of threads to utilize (default: 4)
-cl, --compression_level compression level (default: 19)
The same for zimwriterfs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why putting it in zim.h ?
Because enum class Compression
also defined in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want getDefaultCompressionLevel
being a public API ?
For me not.
We can have two functions is the creator :
# In the header
Creator& configCompression(Compression compression);
Creator& configCompression(Compression compression, int level);
# In the implementation
Creator& configCompression(Compression compression) {
return configCompresssion(compression, getDefaultCompressionLevel(compression));
}
As getDefaultCompressionLevel
is only used internally, we don't need to expose it in the public API.
It could not work if the user code need to know what is the default level. But I don't have such a use case, maybe you have one ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I don't have such a use case, maybe you have one ?
As I mentioned above, getDefaultCompressionLevel
function will be used in zim-tools
for displaying a default value and checking for correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do would you display 9|LZMA_PRESET_EXTREME
(2147483657) ? And why we would like to display it ? It is enough to tell that if no value is provided we will take a correct one (without saying with one). And it allow us to change this value in the future without having to change anything as it is a internal thing.
How will you us the default value for checking ? If the user pass a value, we must blindly use it. And there is nothing to check on compression level in zimcheck
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do would you display 9|LZMA_PRESET_EXTREME
Isn't LZMA deprecated?
And why we would like to display it ?
Why not?
$ zstd -h
-1 .. -19 compression level (faster .. better; default: 3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't LZMA deprecated?
LZMA is obsolete yes. But we still support it, so we must have a default value for it.
Why not?
Because, user has to refer to the specific algorithm documentation to know what are the correct level. I don't want to change our code if zstd add a new 20 compression level.
I would agree with a getDefaultCompressionLevel
in the public API. But I have some mandatory constraints:
- No include of
zstd.h
orlzma.h
in libzim's public header. A user should be able to use a precompiled libzim without needing development headers of compression algorithms available. He may want to still include them if he want to specify a level, but it is his choice. - The default level must be dynamic. If we want to change it a libzim level, we should be able to only rebuild the libzim library, without API/ABI break. zimtools (or other) should not have to be recompiled (so no definition of default value in the header)
- Consequentially of the previous constraints (or this is the cause of the previous constraints ?), this default level must not be part of the api. If we change it, it is not a api break. User must not ("cannot" is better) base his code on this value. If we want to define it to
1
, we must be able to do so. It would be less efficient at compression level but we can. (And if user doesn't tell us what value to use, he trusts us to use some value). Any issue like "I was used to dogetDefaultCompressionLevel()-1
and now it is broken" will be refused.
I'm still not convinced we need to publicly define getDefaultCompressionLevel
but if you really think it is needed, I'm ok with that.
@data-man @mgautierfr What is the status here. The PR meanwhile is pretty old and meanwhile does not seem to make progresses. |
Sorry, I've missed the @data-man answer in the review comments. |
@data-man Any feedback? This PR is really moving too slowly. I consider closing it if we don’t have an end in view. |
I'll rework the API after thinking about it, sorry. |
Closes #544.