Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put windowFrac in the CCtxParams, not the CParams #4305

Draft
wants to merge 36 commits into
base: dev
Choose a base branch
from

Conversation

felixhandte
Copy link
Contributor

@felixhandte felixhandte commented Feb 20, 2025

This PR is a set of commits stacked on top of #4275.

It refactors much of that PR, by moving where we store the windowFrac parameter from the CParams to instead storing it in the CCtxParams. This resolve the unpleasant part of that PR, which required splitting the CParams (and Params) structs into public and private variants.

This version has its own downsides, which requires that we convert many of the places that we handle and pass CParams around to instead pass around CCtxParams. That includes switching the MatchState to point to a CCtxParams rather than hold a CParams. This also requires that the CDict hold a CCtxParams that its MatchState can point to.

A particularly annoying result of this refactor is that where previously there was often a straightforward path of resolution from compression level -> CParams -> CCtxParams, the CCtxParams now have to be used in two different ways. We have to use the CCtxParams where previously we would transport CParams. But in those places, the CCtxParams are incomplete and are different from the real CCtxParams (because only the CParams part is set), and have to be translated/resolved into the real params before they are actually used. But because they're the same type, this distinction is not easy to discover, and it's not clear which kind of CCtxParams a particular instance is at any given point.

One nice corollary though is that this change does allow us in some internal APIs to remove other parameters that are now redundant now that the function has access to the whole CCtxParams.

This PR is a pure refactor and intends to introduce no functional changes.

I continue to believe we should abandon this PR and merge #4275 as-is.

Even though the `ZSTD_compressionParameters` struct is part of the unstable
API, and therefore we are allowed to modify it, it seems likely that to
actually do so would cause widespread chaos and bloodshed. So we are strongly
incentivized to avoid changing it.

We would, however, like to modify the CParams that are actually passed around
internally. In particular, we want to be able to configure non-power-of-2
window sizes. So we need something more expressive than an integral window
log. And we want this new field to be next to the window log, rather than
hanging out in the CCtxParams, which are not passed to the block compressors.

So, in order to support that, this commit:

1. Introduces a new struct, the `ZSTD_CParams`, that (for the moment) mirrors
   the definition of the public `ZSTD_compressionParameters` struct.

2. Codemods all internal use and storage of `cparams` internally to use the
   new struct definition.

   (The exception to this is the `ZSTD_parameters` struct, which is a public
   definition but is also passed around internally sometimes.)

3. Adds translation functions to convert the public and private struct defs
   to each other.

4. Uses those translation functions at the user API boundary (and when
   handling `ZSTD_parameters`).
This commit extends the refactor done in the previous and applies it to the
`ZSTD_parameters` public struct, introducing an internal variant called
`ZSTD_Params` whose only difference is containing a `ZSTD_CParams` instead of
a `ZSTD_compressionParameters`. This commit similarly introduces conversion
functions and rewrites internal functions to use the new, private definition.

This allows us to clean up some internal conversions of the private `CParams`
back into the public `CParams` in order to put them in a `ZSTD_parameters`
struct to pass into something.
Store it in the CParams. Don't use it yet.
Using a blocksize of 100KB with a 65KB dictionary just *happens* to work with
the current window-sizing logic, because it has to round all the way up to
the next power of two. But it doesn't *have* to work.

So this pre-emptively fixes up this test to use a size that will work after
switching to tighter window sizing, but still uses a block size that is
bigger than the dict size in case that matters for some reason?
We want to turn this on by default immediately after this release. And we
want to thoroughly exercise the underlying code paths handling fractional
windows, because they will be available in this upcoming release via new APIs
(the explicit CCtx param and the constrain window for protocol path).

And it's scary stuff.

We can remove all these flags once we turn this on by default.
I found some slight errors via the exhaustive testing added to the fuzzer.
This commit fixes them.
For inputs smaller than `1 << ZSTD_WINDOWLOG_MIN` bytes, the pre-existing
code in `ZSTD_adjustCParams_internal()` temporarily calculated an illegally
small window log for the purposes of adjusting the hash- and chain-table
sizes, before fixing that bad window log to comply with the minimums. The new
code clamped the window log earlier and therefore didn't shrink the tables as
much. This commit fixes that to mirror the pre-existing behavior.

This resolves the remaining regression test differences.
This also means the CDict needs to store and init a CCtxParams object.
…rates on CCtxParams

Sometimes this is used to validate user-provided CParams, before they're
allowed to affect the CCtxParams. Leave those cases as-is. Otherwise, convert
uses of this function to pass in the whole CCtxParams so that, in the next
commit, when we pull out the `windowFrac` from the CParams, we can still
cross-validate it with the `windowLog` CParam.
@Cyan4973 Cyan4973 self-assigned this Feb 24, 2025
@Cyan4973 Cyan4973 self-requested a review February 24, 2025 21:12
/**
* Fractional window sizes can always be picked by the user explicitly
* setting ZSTD_c_windowFrac. This macro controls whether, when Zstd is
* picking a window size itself, it is allowed to pick a non-power-of-two
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

MEM_STATIC U32 ZSTD_windowSize(const ZSTD_CCtx_params* params) {
return (U32)(((8ull + params->windowFrac) << params->cParams.windowLog) >> 3);
}
MEM_STATIC U32 ZSTD_windowSizeLDM(const ldmParams_t* ldmParams) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: it seems this function is only useful within zstd_ldm.c.
In which case, is there any benefit in sharing it more widely within zstd_compress_internal.h ?

* Calculates the minimum legal window log and fraction that contain the
* provided source size.
*/
MEM_STATIC void ZSTD_setMinimalWindowLogAndFrac(ZSTD_CCtx_params* params, const U32 srcSize, const U32 minWindowLog) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here,
ZSTD_setMinimalWindowLogAndFrac() seems only useful within zstd_compress.c.
What benefit is there to share it more widely within zstd_compress_internal.h ?

Note that ZSTD_windowLogAndFracAreMinimal() above seems to be a dependency of this function.

@@ -1100,10 +1186,12 @@ MEM_STATIC U32 ZSTD_window_canOverflowCorrect(ZSTD_window_t const window,
U32 loadedDictEnd,
void const* src)
{
/* overflow correction only handles power-of-two index moves. */
U32 const roundedMaxDist = 1u << (ZSTD_highbit32(maxDist - 1) + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: is this change fixing some kind of potential bug ?

@@ -1562,6 +1656,11 @@ void ZSTD_resetSeqStore(SeqStore_t* ssPtr);
* as the name implies */
ZSTD_compressionParameters ZSTD_getCParamsFromCDict(const ZSTD_CDict* cdict);

/*! ZSTD_checkCCtxCParams_internal() :
* Checks the CParams in the CCtxParams (including related parameters not
* *actually* stored in the CParams struct). */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe name them ?
It seems to be limited to params->windowFrac at this point.

{
params->windowLog = cParams->windowLog;
params->windowLog = cctxParams->cParams.windowLog;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor:
you could use the same trick you employed in other parsers,
by defining const ZSTD_compressionParameters* const cParams = &cctxParams->cParams; at the top of the function, and then most of the rest of the function would remain identical.

@@ -387,6 +387,18 @@ The list of available _options_:
Note: If `windowLog` is set to larger than 27, `--long=windowLog` or
`--memory=windowSize` needs to be passed to the decompressor.

- `windowFrac`=_wfrac_, `wfrac`=_wfrac_:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -102,6 +102,7 @@ void FUZZ_setRandomParameters(ZSTD_CCtx *cctx, size_t srcSize, FUZZ_dataProducer
{
ZSTD_compressionParameters cParams = FUZZ_randomCParams(srcSize, producer);
set(cctx, ZSTD_c_windowLog, cParams.windowLog);
setRand(cctx, ZSTD_c_windowFrac, 0, 7, producer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
Great way to naturally bring windowFrac within the fuzzer framework.

ZSTD_CCtx_params params; /* Storage for matchState.cctxParams pointer.
* Not all member fields are used. Used fields:
* cParams, compressionLevel, useRowMatchFinder. */
/* A compression level of 0 indicates that advanced API was used to select CDict params */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -311,7 +311,7 @@ static ZSTD_CCtx_params ZSTD_makeCCtxParamsFromCParams(
/* Adjust advanced params according to cParams */
cctxParams.ldmParams.enableLdm = ZSTD_resolveEnableLdm(cctxParams.ldmParams.enableLdm, &cParams);
if (cctxParams.ldmParams.enableLdm == ZSTD_ps_enable) {
ZSTD_ldm_adjustParameters(&cctxParams.ldmParams, &cParams);
ZSTD_ldm_adjustParameters(&cctxParams.ldmParams, &cctxParams);
Copy link
Contributor

@Cyan4973 Cyan4973 Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that cctxParams.ldmParams is within cctxParams,
one has to wonder why it is initialised separately.
It looks to me that this is just a part of adjusting cctxParams, and should be bundled there,
thus reducing risk of forgetting this action in other places of the code.

* @param compressionLevel If params are derived from a compression level then that compression level, otherwise ZSTD_NO_CLEVEL.
*/
static void
ZSTD_CCtxParams_init_internal(ZSTD_CCtx_params* cctxParams,
Copy link
Contributor

@Cyan4973 Cyan4973 Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: API consistency:
above function is named ZSTD_CCtxParams_init_fromCCtxParams(),
so maybe the function below would benefit a rename along the same lines.

static ZSTD_compressionParameters
ZSTD_adjustCParams_internal(ZSTD_compressionParameters cPar,
static void
ZSTD_adjustCParams_internal(ZSTD_CCtx_params* params,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could benefit from a rename: ZSTD_adjustCCtxParams_internal().
In turn, the new name would help in extending the scope of this function beyond the original ZSTD_compressionParameters fields.

mode);

/* fills unset fields in `dst` with values from `defaults` */
static void ZSTD_fillCParams(
Copy link
Contributor

@Cyan4973 Cyan4973 Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming feels weird.
What is being filled is a ZSTD_CCtx_params* object, not CParams,
albeit admittedly on a reduced internal scope which is somewhat associated with CParams,
but is that actually a good idea to keep the same scope as it was before ?

It looks to me that this operation is about setting parameters that are not yet set,
using a list provided in defaults.
Then a few more parameters must still be set via functions.

But is it meaningful to separate these 2 operations, or should they all be combined in the same one ?

}

size_t ZSTD_estimateCCtxSize_usingCParams(ZSTD_compressionParameters cParams)
static size_t ZSTD_estimateCCtxSize_usingCCtxParamsCParams(const ZSTD_CCtx_params* params)
Copy link
Contributor

@Cyan4973 Cyan4973 Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, ZSTD_estimateCCtxSize_usingCCtxParams() feels enough.

Plus, it opens the opportunity to decide CCtx size based on other parameters present in CCtx_params.
For example, a pretty important one would be hasExternalSequenceProducer.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Feb 25, 2025

While reviewing the zstd_compress.c file, I indeed noticed a recurring pattern where the function parameter CParams is being replaced with CCtxParams*, while maintaining the same scope for the function. This approach has the advantage of minimizing changes to the source code structure.

However, this substitution introduces some scope-related issues. Specifically, CCtxParams* is now used to convey only a subset of its content or is updated partially, and in some cases, it is used as a reference to modify other parts of itself. These issues necessitate a reevaluation of the functions' scope, which previously focused solely on CParams. In the context of CCtxParams* being passed around, such a restricted scope may no longer be relevant.

Unfortunately, addressing these issues requires more than a simple mechanical update from one type to another.
I now wonder: it may have been beneficial to first transition from CParams* to CCtxParams* internally before introducing windowFrac. This intermediate step would have simplified the subsequent addition of windowFrac, reducing the complexity of each step and making them more manageable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants