Skip to content

LogRecord and SpanByte changes #1186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 132 commits into
base: main
Choose a base branch
from
Draft

LogRecord and SpanByte changes #1186

wants to merge 132 commits into from

Conversation

TedHartMS
Copy link
Collaborator

This is an initial draft of a PR to convert Garnet to use a new ObjectAllocator with a revised on-disk format that will use only a single log file, change ISessionFunctions to be (Disk)LogRecord-based and propagate this to operations such as Compaction and Migration, remove TKey and TValue from TsavoriteKV, and many related refactors.

This PR will be long-lived as additional task PRs are merged into storage-v2 before it is ready for merge to main (see "Major Remaining Tasks" below). Currently this branch implements ONLY the in-memory portion; the IO work is not yet implemented.

Highlights:

  • ArgSlice and SpanByte have been merged to a PinnedSpanByte struct with much more limited use. SpanByte remains as a static class providing (ReadOnly)Span<byte> extensions.
  • All keys are now either PinnedSpanByte (mostly at and above the GarnetApi layer) or ReadOnlySpan<byte> at the StorageSession and below, including Tsavorite. There are no longer any byte[] keys. The TKey type argument is gone from Tsavorite and Garnet.
    • Non-SpanByte KeyComparer implementations have been removed.
  • Values are either (ReadOnly)Span<byte> or objects implementing IHeapObject. TValue has been removed from TsavoriteKV.
  • "ref TKey", "ref TValue", and "ref RecordInfo" on ISessionFunctions methods are replaced by one of the LogRecord variants, where they are properties.
  • ETag and Expiration are now properties on LogRecord rather than embedded within Value or IGarnetObject. Internally, they are after the Value and *LogRecord manages shifting as Value shrinks and grows; all "record extra length" management is done by LogRecord and is opaque to ISessionFunctions.
  • GenericAllocator has been replaced by ObjectAllocator; for details, see https://github.com/microsoft/garnet/tree/tedhar/object-allocator/website/docs/dev/tsavorite/logrecord.md. Currently only in-memory operations are implemented.
  • For ObjectAllocator, keys and string values that are past a configurable size can "overflow" to out-of-line byte[], keeping the record size manageable. Tsavorite implements this threshold automatically, and LogRecord manages it internally when returning the Key or ValueSpan.
  • SpanByteScanIterator and GenericScanIterator have been unified into RecordScanIterator
  • Some names have been changed for clarity:
    • "Lockable*" and "Manual*" have become "Transactional*"
    • "Transient" locking has become "Ephemeral"

Major remaining tasks (this will be updated as they are implemented):

  • Complete ObjectAllocator single-LogFile IO.
  • Change object-size tracking from page-level granularity to record-level.
  • Performance evaluation and tuning
  • Fix RENAME and any other remaining areas where the Expiration is not propagated as before, as it is no longer part of Value (including IGarnetObject)
  • Clean up error handling related to the VarLenInputMethods functions (make them consistent with returns from RMWMethods, etc.)
  • Implement JsonObject MemorySize/DiskSize
  • Finish converting and re-enabling all Tsavorite Unit Tests
  • Revise the current Garnet.common to be Garnet.core, then re-create Garnet.common as a common layer between Tsavorite and the Garnet processing layer, containing PinnedSpanByte, epochs, Utility functions (some of which are currently duplicated), and so on.
  • Investigate the performance impact of moving further along the (ReadOnly)Span<byte> path by replacing byte* usage (such as was done for Utility.HashBytes).

- Move CountdownWrapper to its own file
- Rename Lockable* to Transactional*
- Rename Transient* to Ephemeral*
Copy link
Contributor

@kevin-montrose kevin-montrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked through the first 150 files of this - up to libs/server/Transaction/TransactionManager.cs.

I'll continue with my review tomorrow.

Copy link
Contributor

@kevin-montrose kevin-montrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another 150 files reviewed, up through libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/UpsertValueSelector.cs. I'll wrap up tomorrow, time allowing.

Copy link
Contributor

@kevin-montrose kevin-montrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished review.

Mostly nits, some far this is looking very promising.

TedHartMS added 7 commits May 27, 2025 09:20
… for SpanByteAllocator. The AddressType change is a breaking on-disk format change: it shuffles bits around in RecordInfo to add an additional bit adjacent to the old ReadCache bit to mark an address as:

- 00: Reserved
- 11: ReadCache
- 10: InMemory portion of the main log
- 01: On-Disk
@PaulusParssinen
Copy link
Contributor

PaulusParssinen commented Jun 2, 2025

I'll be able to have a look tomorrow at this beast! I have been glancing through it and in general I like the direction 👍

update: this is gonna take a while 😅 Half-way through reading the changes, I'll leave comments/questions for later once I understand the changes a bit better.

Copy link
Contributor

@PaulusParssinen PaulusParssinen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Review marathon mostly over (I skipped test diffs) and I like the changes a lot! A lot of nice consolidation and reducing the complexity of certain parts. Thanks for the useful documentation too, but I won't pretend that I understand all of the new Log changes thoroughly yet.

This PR also seems to do a lot of nice work that is useful if we want to switch (de)serialization logic from Stream (BinaryReader/BinaryWriter) to span based approach in future (I have some local attempts at this and it looks more approachable after this PRs changes).

Comment on lines +232 to +238
var result = 0;
do
{
num >>= 8;
result++;
} while (num > 0);
return result;
Copy link
Contributor

@PaulusParssinen PaulusParssinen Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can turn this branchless :) (given we don't use this for negative values here)

Suggested change
var result = 0;
do
{
num >>= 8;
result++;
} while (num > 0);
return result;
var bitLength = (sizeof(ulong) * 8) - BitOperations.LeadingZeroCount((ulong)(x | 1));
return (bitLength + 7) / 8;

throw new TsavoriteException("SpanByte Keys cannot be mixed with object Values");
if (typeof(TValue) == typeof(SpanByte))
throw new TsavoriteException("SpanByte Values cannot be mixed with object Keys");
freePagePool = new OverflowPool<PageUnit<ObjectPage>>(4, static p => { });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this right (that it is same logic as in my experiment PR), we need to free these overflowing pages.

We could also report these unmanaged allocations for GC if we want to be nice.


public SpanByteAllocatorImpl(AllocatorSettings settings, TStoreFunctions storeFunctions, Func<object, SpanByteAllocator<TStoreFunctions>> wrapperCreator)
: base(settings.LogSettings, storeFunctions, wrapperCreator, settings.evictCallback, settings.epoch, settings.flushCallback, settings.logger)
{
overflowPagePool = new OverflowPool<PageUnit>(4, p => { });
freePagePool = new OverflowPool<PageUnit<Empty>>(4, p => { });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, we should free the unmanaged pool pages.

public int MaxInlineKeySize = 1 << LogSettings.kDefaultMaxInlineKeySizeBits;

/// <summary>
/// Maximum size of a valuie stored inline in the in-memory portion of the main log for <see cref="SpanByteAllocator{TStoreFunctions}"/>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Maximum size of a valuie stored inline in the in-memory portion of the main log for <see cref="SpanByteAllocator{TStoreFunctions}"/>.
/// Maximum size of a value stored inline in the in-memory portion of the main log for <see cref="SpanByteAllocator{TStoreFunctions}"/>.

namespace Tsavorite.core
{
/// <summary>
/// Represents contiguous region of arbitrary _pinned_ memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This emphasis that I added some time ago with the original comment is now much less useful when the type name contains the information 😅

Suggested change
/// Represents contiguous region of arbitrary _pinned_ memory.
/// Represents contiguous region of arbitrary pinned memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants