Skip to content

Feat/disk based sorting#3514

Open
drrtuy wants to merge 14 commits into
mariadb-corporation:stable-23.10from
drrtuy:feat/disk-based-sorting
Open

Feat/disk based sorting#3514
drrtuy wants to merge 14 commits into
mariadb-corporation:stable-23.10from
drrtuy:feat/disk-based-sorting

Conversation

@drrtuy
Copy link
Copy Markdown
Collaborator

@drrtuy drrtuy commented Apr 18, 2025

No description provided.

@drrtuy drrtuy force-pushed the feat/disk-based-sorting branch from bddc3f4 to 6f09dec Compare June 17, 2025 17:21
@drrtuy drrtuy force-pushed the feat/disk-based-sorting branch from 6f09dec to 929289e Compare June 17, 2025 19:41
@mariadb-LeonidFedorov
Copy link
Copy Markdown
Collaborator

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a disk-based sorting mechanism, which is a significant feature for handling large datasets. The overall approach of spilling to disk when an OutOfMemoryExcept is caught is sound. The changes are spread across several files, introducing new classes like DiskBasedTopNOrderBy, Dumper, and MemManager, and modifying existing ones like LimitedOrderBy and TupleAnnexStep.

However, the implementation appears to be incomplete. The merge phase for the on-disk sorted runs is stubbed out, and some new functions contain placeholder logic and debugging statements. There are also several critical issues, including a syntax error that will prevent compilation and a potential integer underflow bug. Additionally, there are some code quality issues like public member variables that should be private and risky raw pointer ownership transfers.

I've provided specific comments on these issues. The PR needs further work to complete the feature and address the identified problems before it can be merged.

return res;


} // namespace joblist No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This closing brace appears to close the getGenerationFileNamesNextBatch function, but the comment suggests it's for the joblist namespace. The joblist namespace, opened on line 22, is missing its closing brace. This will cause a compilation error.

Suggested change
} // namespace joblist
}
} // namespace joblist

Comment on lines +57 to +59
auto totalNumberOfFilesYetToMerge = getGenerationFilesNumber() - batchSize;
auto batchSizeOrFilesLeftNumber = std::max(getGenerationFilesNumber(), batchSize);
auto actualBatchSize = std::min(totalNumberOfFilesYetToMerge, batchSizeOrFilesLeftNumber);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This logic for calculating actualBatchSize is confusing and contains a bug. totalNumberOfFilesYetToMerge will underflow if getGenerationFilesNumber() is less than batchSize. Given that getGenerationFilesNumber() is a stub returning 0, this is a definite issue. Please revise this logic for correctness and clarity. A safer way to calculate totalNumberOfFilesYetToMerge would be size_t totalNumberOfFilesYetToMerge = (num_files > batchSize) ? num_files - batchSize : 0;. The overall logic might need simplification depending on the intended behavior.

Comment on lines +728 to +736
while (inputQueuesNumber < fOrderBy->getGenerationFilesNumber())
{
auto fileNames = fOrderBy->getGenerationFileNamesNextBatch(inputQueuesNumber);
auto inputDLs = createInputDLs(fileNames.size());
auto readers = startReaders(inputDLs, fileNames);
// create outputDLs or simplier atomic queues + readers threads
fOrderBy->diskBasedMergePhaseIfNeeded(inputDLs);
jobstepThreadPool.join(readers);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The disk-based merge logic appears to be incomplete. The loop while (inputQueuesNumber < fOrderBy->getGenerationFilesNumber()) will never execute because getGenerationFilesNumber() is a stub that returns 0. The body of the loop also consists of stubs and comments. This logic needs to be fully implemented for the feature to work correctly.

Comment thread utils/rowgroup/dumper.h
Comment on lines +37 to +38
Dumper(const compress::CompressInterface* comp, MemManager* mm)
: fCompressor(comp), fMM(std::unique_ptr<MemManager>(mm)) {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This constructor takes a raw pointer MemManager* mm and creates a std::unique_ptr from it, which implies ownership transfer. This is risky as the caller might not be aware of this and could attempt to delete the pointer, leading to a double free. To make ownership transfer explicit and safer, consider removing this constructor and only using the one that accepts a std::unique_ptr.

Comment on lines +72 to +74
// private:
uint64_t fGenerationCounter{0};
std::queue<uint64_t> fGenerations;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These member variables are currently public, but the commented-out // private: suggests they are intended to be private. For better encapsulation, they should be made private.

Suggested change
// private:
uint64_t fGenerationCounter{0};
std::queue<uint64_t> fGenerations;
private:
uint64_t fGenerationCounter{0};
std::queue<uint64_t> fGenerations;

Comment on lines +429 to +431
std::cout << "getNextRGData rowsToRetrieve " << rowsToRetrieve << " orderedRowsQueue.size() "
<< orderedRowsQueue.size() << std::endl;
std::cout << "getNextRGData rowsToRetrieveFromQueue " << rowsToRetrieveFromQueue << std::endl;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function contains std::cout statements which appear to be for debugging. Please remove them before merging.

#include <fcntl.h>
#include <string>

#include <string.h>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For C++ code, it's idiomatic to use <cstring> instead of the C-style header <string.h>.

Suggested change
#include <string.h>
#include <cstring>

Comment on lines +45 to +53
void release(ssize_t amount = 0)
{
// in some cases it tries to release more memory than acquired, ie create
// new rowgroup, acquire maximum size (w/o strings), add some rows with
// strings and finally release the actual size of RG with strings
if (amount == 0 || amount > fMemUsed)
amount = fMemUsed;
releaseImpl(amount);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default argument amount = 0 for release() leads to releasing all memory, which can be surprising. If this is intended, a separate function like releaseAll() would be more explicit and less prone to accidental misuse. If release(0) should be a no-op, the implementation should be adjusted.

Comment thread utils/rowgroup/rowgroup.h
}

private:
// private:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The private: access specifier is commented out, making the following members public. This breaks encapsulation and is likely intended only for debugging. Please restore the private access.

 private:

std::unique_ptr<SortingPQ> fOrderByQueue = nullptr;

protected:
// protected:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The protected: access specifier is commented out, making the following members public. This breaks encapsulation and is likely intended only for debugging. Please restore the protected access.

Suggested change
// protected:
protected:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants