Skip to content

User-defined compression level #661

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion examples/createZimExample.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
#include <zim/writer/contentProvider.h>
#include <zim/writer/creator.h>
#include <zim/blob.h>
#include <zim/zim.h>

class TestItem : public zim::writer::Item
{
Expand Down Expand Up @@ -75,7 +76,7 @@ int main(int argc, char* argv[])
unsigned max = 16;
try {
zim::writer::Creator c;
c.configVerbose(false).configCompression(zim::Compression::Zstd);
c.configVerbose(false).configCompression(zim::Compression::Zstd, static_cast<int>(zim::ZSTDCompressionLevel::DEFAULT));
c.startZimCreation("foo.zim");
for (unsigned n = 0; n < max; ++n)
{
Expand Down
3 changes: 2 additions & 1 deletion include/zim/writer/creator.h
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ namespace zim
* @param comptype the compression algorithm to use.
* @return a reference to itself.
*/
Creator& configCompression(Compression compression);
Creator& configCompression(Compression compression, int compression_level);

/**
* Set the size of the created clusters.
Expand Down Expand Up @@ -205,6 +205,7 @@ namespace zim
// configuration
bool m_verbose = false;
Compression m_compression = Compression::Zstd;
int m_compressionLevel = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use ZSTDCompressionLevel::DEFAULT in case a user never call configCompression

bool m_withIndex = false;
size_t m_clusterSize;
std::string m_indexingLanguage;
Expand Down
12 changes: 12 additions & 0 deletions include/zim/zim.h
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,18 @@ namespace zim
Zstd = 5
};

enum class LZMACompressionLevel: int {
MINIMUM = 0,
MAXIMUM = 9,
DEFAULT = MAXIMUM
};

enum class ZSTDCompressionLevel: int {
MINIMUM = -21,
MAXIMUM = 19,
DEFAULT = MAXIMUM
};
Comment on lines +63 to +73
Copy link
Collaborator

@mgautierfr mgautierfr Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet again, I dislike those enums.
19 is not the maximum for zstd algorithm. It is 21 (and it may change in the futur). This is the same for 9 and LZMA.
Those values convey very few information and are pretty useless.
We should keep the default value (as we need to store theme somewhere) but I would go with something more like:

enum class DefaultCompressionLevel: int {
     LZMA = 9 | LZMA_PRESET_EXTREME,
     ZSTD = 19
};

But we don't want to expose LZMA "symbol" (LZMA_PRESET_EXTREME) in public zim headers. So we should copy the value and put a explicit comment from where the value come from.
Or we can simply not expose the default value and have two different configCompression (with and without compression level), the version without could simply set the default value (we would not have to publicly define the default value as we use it internally and never define a default value for a argument)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet again, I dislike those enums.

@mgautierfr what about something like this (in zim.h):

int getDefaultCompressionLevel(Compression comp) {
  switch(comp) {
    case Compression::None:
      {
        return 0;
        break;
      }

    case Compression::Lzma:
      {
        return 9 | LZMA_PRESET_EXTREME;
        break;
      }

    case Compression::Zstd:
      {
        return 19;
        break;
      }
  };
}

And use it in appropriate places and in the zim-tools.
?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why putting it in zim.h ? We should use it internally. Either you pass a compression level (and you know which one) or you don't and libzim picks one internally.

And use it in appropriate places and in the zim-tools.

In fact, I'm not sure we want the same thing.
I "only" want to allow the user to pass a custom compression level.
But it seems you want to expose some compression level and use them somehow (in zim-tools).
In this case, we cannot agree as we are not wanted the same thing 🙂

Can you be more explicit on how you want to use the default level in zim-tools ? Maybe you have a branch somewhere you can share.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ zimrecreate

zimrecreate

[ERROR] Not enough Arguments provided

zimrecreate recreates a ZIM file from a existing ZIM.
Usage: zimrecreate ORIGIN_FILE OUTPUT_FILE [Options]

Options:
-v, --version print software version
-j, --withoutFTIndex don't create and add a fulltext index of the content to the ZIM
-J, --threads count of threads to utilize (default: 4)
-cl, --compression_level compression level (default: 19)

The same for zimwriterfs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why putting it in zim.h ?

Because enum class Compression also defined in this file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want getDefaultCompressionLevel being a public API ?
For me not.
We can have two functions is the creator :

# In the header
Creator& configCompression(Compression compression);
Creator& configCompression(Compression compression, int level);
# In the implementation
Creator& configCompression(Compression compression) {
    return configCompresssion(compression, getDefaultCompressionLevel(compression));
}

As getDefaultCompressionLevel is only used internally, we don't need to expose it in the public API.
It could not work if the user code need to know what is the default level. But I don't have such a use case, maybe you have one ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't have such a use case, maybe you have one ?

As I mentioned above, getDefaultCompressionLevel function will be used in zim-tools for displaying a default value and checking for correctness.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do would you display 9|LZMA_PRESET_EXTREME (2147483657) ? And why we would like to display it ? It is enough to tell that if no value is provided we will take a correct one (without saying with one). And it allow us to change this value in the future without having to change anything as it is a internal thing.
How will you us the default value for checking ? If the user pass a value, we must blindly use it. And there is nothing to check on compression level in zimcheck

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do would you display 9|LZMA_PRESET_EXTREME

Isn't LZMA deprecated?

And why we would like to display it ?

Why not?
$ zstd -h

-1 .. -19 compression level (faster .. better; default: 3)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't LZMA deprecated?

LZMA is obsolete yes. But we still support it, so we must have a default value for it.

Why not?

Because, user has to refer to the specific algorithm documentation to know what are the correct level. I don't want to change our code if zstd add a new 20 compression level.

I would agree with a getDefaultCompressionLevel in the public API. But I have some mandatory constraints:

  • No include of zstd.h or lzma.h in libzim's public header. A user should be able to use a precompiled libzim without needing development headers of compression algorithms available. He may want to still include them if he want to specify a level, but it is his choice.
  • The default level must be dynamic. If we want to change it a libzim level, we should be able to only rebuild the libzim library, without API/ABI break. zimtools (or other) should not have to be recompiled (so no definition of default value in the header)
  • Consequentially of the previous constraints (or this is the cause of the previous constraints ?), this default level must not be part of the api. If we change it, it is not a api break. User must not ("cannot" is better) base his code on this value. If we want to define it to 1, we must be able to do so. It would be less efficient at compression level but we can. (And if user doesn't tell us what value to use, he trusts us to use some value). Any issue like "I was used to do getDefaultCompressionLevel()-1 and now it is broken" will be refused.

I'm still not convinced we need to publicly define getDefaultCompressionLevel but if you really think it is needed, I'm ok with that.


static const char MimeHtmlTemplate[] = "text/x-zim-htmltemplate";

enum class IntegrityCheck
Expand Down
37 changes: 22 additions & 15 deletions src/compression.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,22 +26,29 @@
#include <stdexcept>

const std::string LZMA_INFO::name = "lzma";
void LZMA_INFO::init_stream_decoder(stream_t* stream, char* raw_data)

void LZMA_INFO::init_stream_encoder(stream_t* stream, int compression_level, char* raw_data)
{
*stream = LZMA_STREAM_INIT;
unsigned memsize = zim::envMemSize("ZIM_LZMA_MEMORY_SIZE", LZMA_MEMORY_SIZE * 1024 * 1024);
auto errcode = lzma_stream_decoder(stream, memsize, 0);
int cl = compression_level;

if (cl == static_cast<int>(zim::LZMACompressionLevel::MAXIMUM)) {
cl |= LZMA_PRESET_EXTREME;
}
Comment on lines +35 to +37
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not change the compression level passed by the user. It should be a simple pass through.
With this, a user would never be able to set a compression level of 9 without LZMA_PRESET_EXTREME.


auto errcode = lzma_easy_encoder(stream, cl, LZMA_CHECK_CRC32);
if (errcode != LZMA_OK) {
throw std::runtime_error("Impossible to allocated needed memory to uncompress lzma stream");
throw std::runtime_error("Cannot initialize lzma_easy_encoder");
}
}

void LZMA_INFO::init_stream_encoder(stream_t* stream, char* raw_data)
void LZMA_INFO::init_stream_decoder(stream_t* stream, char* raw_data)
{
*stream = LZMA_STREAM_INIT;
auto errcode = lzma_easy_encoder(stream, 9 | LZMA_PRESET_EXTREME, LZMA_CHECK_CRC32);
unsigned memsize = zim::envMemSize("ZIM_LZMA_MEMORY_SIZE", LZMA_MEMORY_SIZE * 1024 * 1024);
auto errcode = lzma_stream_decoder(stream, memsize, 0);
if (errcode != LZMA_OK) {
throw std::runtime_error("Cannot initialize lzma_easy_encoder");
throw std::runtime_error("Impossible to allocated needed memory to uncompress lzma stream");
}
}

Expand Down Expand Up @@ -103,21 +110,21 @@ ZSTD_INFO::stream_t::~stream_t()
::ZSTD_freeDStream(decoder_stream);
}

void ZSTD_INFO::init_stream_decoder(stream_t* stream, char* raw_data)
void ZSTD_INFO::init_stream_encoder(stream_t* stream, int compression_level, char* raw_data)
{
stream->decoder_stream = ::ZSTD_createDStream();
auto ret = ::ZSTD_initDStream(stream->decoder_stream);
stream->encoder_stream = ::ZSTD_createCStream();
auto ret = ::ZSTD_initCStream(stream->encoder_stream, compression_level);
if (::ZSTD_isError(ret)) {
throw std::runtime_error("Failed to initialize Zstd decompression");
throw std::runtime_error("Failed to initialize Zstd compression");
}
}

void ZSTD_INFO::init_stream_encoder(stream_t* stream, char* raw_data)
void ZSTD_INFO::init_stream_decoder(stream_t* stream, char* raw_data)
{
stream->encoder_stream = ::ZSTD_createCStream();
auto ret = ::ZSTD_initCStream(stream->encoder_stream, 19);
stream->decoder_stream = ::ZSTD_createDStream();
auto ret = ::ZSTD_initDStream(stream->decoder_stream);
if (::ZSTD_isError(ret)) {
throw std::runtime_error("Failed to initialize Zstd compression");
throw std::runtime_error("Failed to initialize Zstd decompression");
}
}

Expand Down
8 changes: 4 additions & 4 deletions src/compression.h
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,8 @@ enum class RunnerStatus {
struct LZMA_INFO {
typedef lzma_stream stream_t;
static const std::string name;
static void init_stream_encoder(stream_t* stream, int compression_level, char* raw_data);
static void init_stream_decoder(stream_t* stream, char* raw_data);
static void init_stream_encoder(stream_t* stream, char* raw_data);
static CompStatus stream_run_encode(stream_t* stream, CompStep step);
static CompStatus stream_run_decode(stream_t* stream, CompStep step);
static CompStatus stream_run(stream_t* stream, CompStep step);
Expand Down Expand Up @@ -89,8 +89,8 @@ struct ZSTD_INFO {
};

static const std::string name;
static void init_stream_encoder(stream_t* stream, int compression_level, char* raw_data);
static void init_stream_decoder(stream_t* stream, char* raw_data);
static void init_stream_encoder(stream_t* stream, char* raw_data);
static CompStatus stream_run_encode(stream_t* stream, CompStep step);
static CompStatus stream_run_decode(stream_t* stream, CompStep step);
static void stream_end_encode(stream_t* stream);
Expand Down Expand Up @@ -233,8 +233,8 @@ class Compressor

~Compressor() = default;

void init(char* data) {
INFO::init_stream_encoder(&stream, data);
void init(int compression_level, char * data) {
INFO::init_stream_encoder(&stream, compression_level, data);
stream.next_out = (uint8_t*)ret_data.get();
stream.avail_out = ret_size;
}
Expand Down
6 changes: 4 additions & 2 deletions src/writer/cluster.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
#include "../debug.h"
#include "../compression.h"

#include <zim/zim.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need this include ?

#include <zim/writer/contentProvider.h>

#include <sstream>
Expand All @@ -45,8 +46,9 @@ const zim::size_type MAX_WRITE_SIZE(4UL*1024*1024*1024-1);
namespace zim {
namespace writer {

Cluster::Cluster(Compression compression)
Cluster::Cluster(Compression compression, int compression_level)
: compression(compression),
compressionLevel(compression_level),
isExtended(false),
_size(0)
{
Expand Down Expand Up @@ -152,7 +154,7 @@ void Cluster::_compress()
bool first = true;
auto writer = [&](const Blob& data) -> void {
if (first) {
runner.init((char*)data.data());
runner.init(compressionLevel, (char*)data.data());
first = false;
}
runner.feed(data.data(), data.size());
Expand Down
4 changes: 3 additions & 1 deletion src/writer/cluster.h
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,11 @@ class Cluster {


public:
Cluster(Compression compression);
Cluster(Compression compression, int compression_level);
virtual ~Cluster();

void setCompression(Compression c) { compression = c; }
void setCompressionLevel(int cl) { compressionLevel = cl; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never use setCompression. I'm in favor of removing it instead of adding another setCompressionLevel

Compression getCompression() const { return compression; }

void addContent(std::unique_ptr<ContentProvider> provider);
Expand Down Expand Up @@ -78,6 +79,7 @@ class Cluster {

protected:
Compression compression;
int compressionLevel;
cluster_index_t index;
bool isExtended;
Offsets blobOffsets;
Expand Down
15 changes: 9 additions & 6 deletions src/writer/creator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -110,14 +110,15 @@ namespace zim
return *this;
}

Creator& Creator::configCompression(Compression compression)
Creator& Creator::configCompression(Compression compression, int compressionLevel)
{
if(compression == Compression::Lzma) {
std::cerr << "WARNING: LZMA compression method is deprecated."
<< " Support for it will be dropped from libzim soon."
<< std::endl;
}
m_compression = compression;
m_compressionLevel = compressionLevel;
return *this;
}

Expand All @@ -143,7 +144,7 @@ namespace zim
void Creator::startZimCreation(const std::string& filepath)
{
data = std::unique_ptr<CreatorData>(
new CreatorData(filepath, m_verbose, m_withIndex, m_indexingLanguage, m_compression, m_clusterSize)
new CreatorData(filepath, m_verbose, m_withIndex, m_indexingLanguage, m_compression, m_compressionLevel, m_clusterSize)
);

for(unsigned i=0; i<m_nbWorkers; i++)
Expand Down Expand Up @@ -394,9 +395,11 @@ namespace zim
bool withIndex,
std::string language,
Compression c,
int compression_level,
size_t clusterSize)
: mainPageDirent(nullptr),
compression(c),
compressionLevel(compression_level),
zimName(fname),
tmpFileName(fname + ".tmp"),
clusterSize(clusterSize),
Expand Down Expand Up @@ -435,8 +438,8 @@ namespace zim
// because we don't know which one will fill up first. We also need
// to track the dirents currently in each, so we can fix up the
// cluster index if the other one ends up written first.
compCluster = new Cluster(compression);
uncompCluster = new Cluster(Compression::None);
compCluster = new Cluster(compression, compression_level);
uncompCluster = new Cluster(Compression::None, 0);

#if defined(ENABLE_XAPIAN)
auto xapianIndexer = std::make_shared<XapianHandler>(this, withIndex);
Expand Down Expand Up @@ -591,9 +594,9 @@ namespace zim

if (compressed)
{
cluster = compCluster = new Cluster(compression);
cluster = compCluster = new Cluster(compression, compressionLevel);
} else {
cluster = uncompCluster = new Cluster(Compression::None);
cluster = uncompCluster = new Cluster(Compression::None, 0);
}
return cluster;
}
Expand Down
2 changes: 2 additions & 0 deletions src/writer/creatordata.h
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ namespace zim
CreatorData(const std::string& fname, bool verbose,
bool withIndex, std::string language,
Compression compression,
int compressionLevel,
size_t clusterSize);
virtual ~CreatorData();

Expand Down Expand Up @@ -102,6 +103,7 @@ namespace zim
ThreadList workerThreads;
std::thread writerThread;
const Compression compression;
int compressionLevel;
std::string zimName;
std::string tmpFileName;
bool isEmpty = true;
Expand Down
14 changes: 7 additions & 7 deletions test/cluster.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ using zim::unittests::write_to_buffer;

TEST(ClusterTest, create_cluster)
{
zim::writer::Cluster cluster(zim::Compression::None);
zim::writer::Cluster cluster(zim::Compression::None, 0);

ASSERT_EQ(cluster.count().v, 0U);

Expand All @@ -86,7 +86,7 @@ TEST(ClusterTest, create_cluster)

TEST(ClusterTest, read_write_cluster)
{
zim::writer::Cluster cluster(zim::Compression::None);
zim::writer::Cluster cluster(zim::Compression::None, 0);

std::string blob0("123456789012345678901234567890");
std::string blob1("ABCDEFGHIJKLMNOPQRSTUVWXYZ");
Expand All @@ -110,7 +110,7 @@ TEST(ClusterTest, read_write_cluster)

TEST(ClusterTest, read_write_no_content)
{
zim::writer::Cluster cluster(zim::Compression::None);
zim::writer::Cluster cluster(zim::Compression::None, 0);

cluster.close();
auto buffer = write_to_buffer(cluster, "\3garbage");
Expand All @@ -123,7 +123,7 @@ TEST(ClusterTest, read_write_no_content)

TEST(ClusterTest, read_write_empty)
{
zim::writer::Cluster cluster(zim::Compression::None);
zim::writer::Cluster cluster(zim::Compression::None, 0);

std::string emptyString;

Expand All @@ -145,7 +145,7 @@ TEST(ClusterTest, read_write_empty)

TEST(ClusterTest, read_write_clusterLzma)
{
zim::writer::Cluster cluster(zim::Compression::Lzma);
zim::writer::Cluster cluster(zim::Compression::Lzma, static_cast<int>(zim::LZMACompressionLevel::DEFAULT));

std::string blob0("123456789012345678901234567890");
std::string blob1("ABCDEFGHIJKLMNOPQRSTUVWXYZ");
Expand All @@ -172,7 +172,7 @@ TEST(ClusterTest, read_write_clusterLzma)

TEST(ClusterTest, read_write_clusterZstd)
{
zim::writer::Cluster cluster(zim::Compression::Zstd);
zim::writer::Cluster cluster(zim::Compression::Zstd, static_cast<int>(zim::ZSTDCompressionLevel::DEFAULT));

std::string blob0("123456789012345678901234567890");
std::string blob1("ABCDEFGHIJKLMNOPQRSTUVWXYZ");
Expand Down Expand Up @@ -244,7 +244,7 @@ TEST(ClusterTest, read_write_extended_cluster)
auto bigProvider = std::unique_ptr<zim::writer::ContentProvider>(new FakeProvider(almost_4g));
std::string blob4("zyxwvutsrqponmlkjihgfedcba");

zim::writer::Cluster cluster(zim::Compression::None);
zim::writer::Cluster cluster(zim::Compression::None, 0);
cluster.addContent(blob0);
cluster.addContent(blob1);
cluster.addContent(blob2);
Expand Down
2 changes: 1 addition & 1 deletion test/compression.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ TYPED_TEST(CompressionTest, compress) {
size_t offset = 0;
while (size) {
if (first) {
compressor.init(const_cast<char*>(data.c_str()));
compressor.init(9, const_cast<char*>(data.c_str()));
first = false;
}
auto adjustedChunkSize = std::min(size, chunkSize);
Expand Down
2 changes: 1 addition & 1 deletion test/decoderstreamreader.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ std::string
compress(const std::string& data)
{
zim::Compressor<CompressionInfo> compressor(data.size());
compressor.init(const_cast<char*>(data.c_str()));
compressor.init(1, const_cast<char*>(data.c_str()));
compressor.feed(data.c_str(), data.size());
zim::zsize_t comp_size;
const auto comp_data = compressor.get_data(&comp_size);
Expand Down
4 changes: 2 additions & 2 deletions test/dirent.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ TEST(DirentTest, set_get_data_dirent)
TEST(DirentTest, read_write_article_dirent)
{
zim::writer::Dirent dirent(NS::C, "Bar", "Foo", 17);
zim::writer::Cluster cluster(zim::Compression::None);
zim::writer::Cluster cluster(zim::Compression::None, 0);
cluster.addContent(""); // Add a dummy content
cluster.setClusterIndex(zim::cluster_index_t(45));
dirent.setCluster(&cluster);
Expand Down Expand Up @@ -134,7 +134,7 @@ TEST(DirentTest, read_write_article_dirent)
TEST(DirentTest, read_write_article_dirent_unicode)
{
zim::writer::Dirent dirent(NS::C, "L\xc3\xbcliang", "", 17);
zim::writer::Cluster cluster(zim::Compression::None);
zim::writer::Cluster cluster(zim::Compression::None, 0);
cluster.addContent(""); // Add a dummy content
cluster.setClusterIndex(zim::cluster_index_t(45));
dirent.setCluster(&cluster);
Expand Down