[RFC] Read cache for pg_mooncake on remote storage #69

dentiny · 2025-01-06T02:00:10Z

dentiny
Jan 6, 2025
Collaborator

In the current implementation pg_mooncake has write cache:

Open cache file when open data file for write as well:

pg_mooncake/src/columnstore/columnstore_table.cpp

Lines 20 to 40 in 8d3575c

    
           unique_ptr<FileHandle> OpenFile(const string &path, FileOpenFlags flags, 
        
                                           optional_ptr<FileOpener> opener = nullptr) override { 
        
               if (IsRemoteFile(path) && mooncake_enable_local_cache) { 
        
                   auto disk_space = fs.GetAvailableDiskSpace(x_mooncake_local_cache); 
        
                   if (disk_space.IsValid() && disk_space.GetIndex() > x_min_disk_space) { 
        
                       cached_file = fs.OpenFile(cached_file_path, flags, opener); 
        
                   } 
        
               } 
        
               return fs.OpenFile(path, flags, opener); 
        
           } 
        
           int64_t Write(FileHandle &handle, void *buffer, int64_t nr_bytes) override { 
        
               if (stream) { 
        
                   stream->WriteData(data_ptr_cast(buffer), nr_bytes); 
        
               } 
        
               if (cached_file) { 
        
                   int64_t bytes_written = fs.Write(*cached_file, buffer, nr_bytes); 
        
                   D_ASSERT(bytes_written == nr_bytes); 
        
               } 
        
               return fs.Write(handle, buffer, nr_bytes); 
        
           }

On scan operation, cache file path is included and passed to parquet reader:

pg_mooncake/src/columnstore/execution/columnstore_scan.cpp

Lines 137 to 165 in 8d3575c

    
           TableFunction ColumnstoreTable::GetScanFunction(ClientContext &context, unique_ptr<FunctionData> &bind_data) { 
        
               auto path = metadata->TablesSearch(oid); 
        
               auto file_names = metadata->DataFilesSearch(oid, &context, &columns); 
        
               auto file_paths = GetFilePaths(path, file_names); 
        
               if (file_paths.empty()) { 
        
                   return TableFunction("columnstore_scan", {} /*arguments*/, EmptyColumnstoreScan); 
        
               } 
        
               TableFunction columnstore_scan = GetParquetScan(context); 
        
               columnstore_scan.name = "columnstore_scan"; 
        
               columnstore_scan.init_global = ColumnstoreScanInitGlobal; 
        
               columnstore_scan.get_multi_file_reader = ColumnstoreScanMultiFileReader::Create; 
        
               vector<Value> values; 
        
               for (auto &file_path : file_paths) { 
        
                   values.push_back(Value(file_path)); 
        
               } 
        
               vector<Value> inputs; 
        
               inputs.push_back(Value::LIST(values)); 
        
               named_parameter_map_t named_parameters{{"file_row_number", Value(true)}}; 
        
               vector<LogicalType> input_table_types; 
        
               vector<string> input_table_names; 
        
               TableFunctionBindInput bind_input(inputs, named_parameters, input_table_types, input_table_names, nullptr /*info*/, 
        
                                                 nullptr /*binder*/, columnstore_scan, {} /*ref*/); 
        
               vector<LogicalType> return_types; 
        
               vector<string> names; 
        
               bind_data = columnstore_scan.bind(context, bind_input, return_types, names); 
        
               return columnstore_scan; 
        
           }

But it comes with certain limitations:

We don’t limit volume of cache files created at local filesystem, nor we delete obsolete files, which would lead to out-of-disk issue
Serverless postgres deployment like neondb recycles local cache, which leads to degraded query performance
Hard to parallelize local write and remote write due to possible failures; for example, cache write success while actual data file write failure leads to untracked / redundant cache file which requires extra handling
- Sequential writes increase latency as well

Here I propose to switch write cache to read one:

Keep disk-based cache instead of memory-based query, so we could leverage the duckdb’s parquet extension (which is file-based loading)
- Example code: https://github.com/duckdb/duckdb/blob/4488c61ee780635e67abe1b6164f2cdfadc21b65/extension/parquet/parquet_reader.cpp#L60-L63
Easier failure handling compared with write cache, if cache read fails we just fallback to actual data file with no inconsistency concern
Better performance in terms of both latency and throughput; compared with write operations, at read we know file size in advance so read operations could be parallelized in chunks

Read cache difficulty:

I don’t see a good way to cleanup stale read cache files with completely no risk
Duckdb allows multiple instances / processes to read from database concurrently (https://duckdb.org/docs/connect/concurrency.html), so it’s naturally hard to have a reference count system to account for files being used
One easy-to-implement and practical way is to cleanup stale cache file at start or periodically at runtime, based on cache files’ modification timestamp

Unresolved corrupted cache file handling:

We cannot tolerate a corrupted cache file for now (i.e. we don’t sync the cache file on close), even if the actual data file on remote storage is fine
If we pass the corrupted cache file to binding phase, it will failure directly with no fallback option

I made a POC PR for read cache, feel free to leave comments and happy to discuss!
PR: dentiny#1

dpxcc · 2025-01-07T02:24:39Z

dpxcc
Jan 7, 2025
Maintainer

Thanks, I've left some initial comments in the POC PR

Can you be more specific about the inconsistency concerns? In case of cache write success while actual data file write failure, the existing code leaks the cache file, but I don't think it can lead to any data inconsistency

1 reply

dentiny Jan 7, 2025
Collaborator Author

Yeah I mean "untracked files", updated the wording to be more accurate.

dentiny · 2025-01-07T10:59:00Z

dentiny
Jan 7, 2025
Collaborator Author

On reference count for files being used, one approach I'm thinking:

We make our own filesystem and registered into virtual filesystem, which implements our own read logic and performs cache logic;
Considering postgres takes a multi-processing model, use unlogged tables to keep reference count for each open file;
- I checked unlogged table is visible across multiple postgres session (via psql)
Make our own file handle, which increases reference count at construction, and decrease at destruction.

Pseudocode looks like:

class ReadCacheFileHandle : public FileHandle {
  ReadCacheFileHandle() { // increase ref count in unlogged table }
  ~ReadCacheFileHandle() { // decrease ref count in unlogged table }
};

class ColumnstoreFileSystem : FileSystem {
  Open() {
    if read operation and open for remote storage
      return ReadCacheFileHandle
   else
      return whatever virtual filesystem gives
  }

  // Implement read-related operations with cache enabled
  // Delegate all other operations to virtual filesystem

  // Has-a virtual filesystem instance to delegate operations.
  VirtualFileSystem vfs_;
};

2 replies

dpxcc Jan 8, 2025
Maintainer

The overall plan looks good to me. It would be helpful to include more details about where and how files are downloaded and cached.

I just realized that the ref counting might be more tricky - Postgres is process-based, and when a process crashes, it won't run ReadCacheFileHandle::~ReadCacheFileHandle() and will leak ref count in unlogged table.

dentiny Jan 8, 2025
Collaborator Author

I will make a state machine transfer table before the second PR for reference count.

dentiny · 2025-01-07T11:11:36Z

dentiny
Jan 7, 2025
Collaborator Author

There're some other details worth noticing:

Cache all files on disk, vs cache data file on disk and metadata file in memory
Cache the whole file, vs only cache file blocks when they're read and we do alignment ourselves
Read the whole file, vs read file in chunks (somewhat related to the above item)
Use a global threadpool to read, vs open new threads to read every single request

I think for the initial version, it's ok to

Cache all files on disk, later we could cache metadata both on disk and in memory
Read the whole file in one request; an easy followup is to read in 2MB chunk, which is recommended by GCS (Google Cloud Storage)
Cache the whole file for simplicity; caching part of the file blocks requires further investigation on access pattern
Open new threads to read every single request; threadpool-bounded IO and fancy asynchronous IO could be implemented later if we found resource problem (i.e. either TCP connection over-bloat or thread-number too large)

Implementation-wise, I think it could be split into two PRs:

One for filesystem instance, which does the basic read cache with no reclamation;
Another for unreferenced file reclamation.

On testing, for storage / filesystem related code, I think it's better to have C++ related testing via gtest other than merely SQL test.

1 reply

dpxcc Jan 8, 2025
Maintainer

I agree with the initial version's scope and the proposed split of PRs
And yes, gtest-based unit testing seems more suitable for this filesystem

Cache all files on disk, vs cache data file on disk and metadata file in memory

For context, Parquet metadata is already stored on disk in mooncake.data_files's file_metadata: #58
DuckDB has native ways to cache this metadata in memory: https://github.com/duckdb/duckdb/blob/19864453f7d0ed095256d848b46e7b8630989bac/extension/parquet/parquet_reader.cpp#L550 (though we are not enabling it right now)
This logic can be handled orthogonally to this work

dentiny · 2025-01-08T09:36:09Z

dentiny
Jan 8, 2025
Collaborator Author

Thanks for giving me the opportunity!

Timeline-wise, I will try to get a PR out this week; definitely let me know if my progress slow and there's release pressure.

1 reply

dpxcc Jan 8, 2025
Maintainer

Thank you very much for the contribution!
There’s no rush on this one. We’re releasing v0.1 this week, and the read cache feature will be included in v0.2.

ritwizsinha · 2025-01-17T07:10:24Z

ritwizsinha
Jan 17, 2025

I have an elementary question regarding a small part of the design. In pg_mooncake and in duckdb why do we create files on disk and then use the filesystem API to read and write to it? I know it might be easier and convenient but if we instead memory map the files (mmap) wouldn't that be a more efficient write avoiding switching from the user space or kernel space ? This maybe is a textbook idea and not much efficient in practice but I think postgres does that as well?

2 replies

dentiny Jan 17, 2025
Collaborator Author

Thanks for the question! There're a few consideration I've thought about:

(major) Andy wrote a paper on several disadvantages on mmap: https://www.cidrdb.org/cidr2022/papers/p13-crotty.pdf
(major) As you mentioned, for the ease of implementation; using read/write implementation within duckdb source code with our own cache thin wrapper is the easiest and lightest way;
- You mentioned pg uses mmap based solution, could you please point me some code reference and doc so I could learn more about it?
(major) Somewhat related to my above point of view, the read cache is meant to provide much better performance than remote access; generally remote access falls into 10-200 milliseconds, while disk access is ~10 microseconds, the focus on read-cache should be how can we improve cache efficiency thus avoiding remote access, rather than improve already good enough local disk access latency;
- I think there're a number of items we could work on and investigate (see items I listed above)
- Eg. apart from disk-based cache, one thing I think valuable is whether we could cache frequently-access metadata in memory
- I'm not saying disk access latency is not important, but I would like to defer until we found disk access to be bottleneck, and we've worked out other low-handing fruits
(minor) duckdb's local filesystem allows direct IO, which I intentionally disabled, so some local file access would hit page cache (but yeah there're some caveats, like memory overhead)

dpxcc Jan 18, 2025
Maintainer

Yes, @dentiny summarized it well. Andy's point is that databases should implement their own buffer pool rather than relying on mmap. The main drawback of mmap is that munmap() has a mutex and you incur a lot of contention

In the case of Postgres, the mutex issue is likely less problematic because Postgres is process-based. However, it seems that Postgres default Storage Manager, md (Magnetic Disk), still uses preadv() to implement its mdreadv() routine that is used by ReadBuffer()

Eg. apart from disk-based cache, one thing I think valuable is whether we could cache frequently-access metadata in memory

We added a GUC in v0.1.0 to cache Parquet metadata in memory (5fa291a), resulting in ~5% improvement on ClickBench

dentiny · 2025-01-24T10:40:32Z

dentiny
Jan 24, 2025
Collaborator Author

Design proposal for the read cache cleanup.

High-level idea before going to the details:

Attempt to cleanup unreferenced read cache files periodically, or triggered when there's not enough space on the local filesystem
Record the reference count for read cache files being used, so cache file will not be deleted
Failure could happen any time, for example, in the middle of download; so for all operations we use timestamp to indicate its start time with a reasonable timeout (i.e. several hours) to tell whether the operation is stale thus invalid

For performance consideration, use postgres unlogged table to record read cache file, it serves for two purposes:

It could be accessed from multiple postgres processes (I confirmed via psql)
It provides much better performance with no disk and WAL involved

Here's the table schema:

(PK) cache filename, string type, filename for local read cache file; it's deterministically decided from remote filename
An array of cache record being referenced, it's a struct which includes a few data members
- cache read ID, an UUID to identify one cache read
- state, either ONGOING (which means the file download is undergoing), or COMPLETED (which means the file download has completed)
- begin timestamp, the time point when the cache read operation is initiated; it could be used to tell whether a reference is still valid based on a reasonable timeout
- a temporary read cache filename, to achieve read cache download with no intervene from other processes / threads, we first download to a temporary file, then atomically rename to then target; temporary filename is recorded here so if an operation is decided as stale and invalid, cleanup thread is able to delete these temporary files
is deleting: if true, the read cache is being deleted
deletion timestamp: the start timestamp when the deletion happens; used to tell whether the deletion fails in middle thus invalid
An array of access timestamp, used for eviction policy (i.e. LRU-K); every time a new cache operation is created (aka, every time a new reference is created), we update the access timestamp array
For easier review, the whole table schema is
- cache filename, <cache read ID, state, cache begin timestamp>, is deleting, deletion timestamp,

At the start of a query when a remote file is access, we follow these steps to cache remote files locally:

If is deleting and deletion timestamp is fresh enough, it means someone else is deleting the local read cache; we don't do cache in this case, and afford a remote read
Otherwise, read cache is possible, insert a new read cache operation with state ONGOING; and update access timestamp array
If the file doesn't exist in filesystem, download the whole file into the deterministic target file
- Here we assume the remote object file is immutable, thus safe to overwrite
Update read cache operation entry from state to COMPLETED, which means the file download already succeeded, meanwhile reference count is occupied

At the end of query, decrement the reference count for all involved local read cache:

Delete cache operation by cache read ID, it means the reference count for the cache file (which guaranteed to exist) is decreased by 1

Cache cleanup logic is triggered periodically, and when there's not enough space on disk space:

Acquire table-wise write lock to get (1) cache entries to evict based on LRU-K algorithm; (2) temporary local files which belongs to an invalid cache read operation;
- Only invalid temporary read cache and unreferenced cache entries could be deleted
- For those files we have decided to delete, update is deleting to true, and set the deletion timestamp, which serves as a mutex to prevent concurrent read cache download
For each unreferenced cache entry we decide to delete, apply the following operations
- Delete the file and free disk space, cleanup invalid read cache operation, or delete the whole row if possible

Since we use unlogged postgres table for performance, so on postgres startup, there's no such unlogged table, we need to

In pg_mooncake plugin startup, if there's no such table, reconstruct by loading all cached file entries
- We use file modification timestamp as a rough estimation as the read cache's access timestamp (i.e. for LRU-K eviction policy)
Acquire table-wise write lock during reconstruction process, so there's no concurrent accesses during the process

A quick and simple performance consideration and analysis:

Use unlogged table, so purely in-memory operation without disk involved
On read critical path, unlogged table rows are accessed for three times

Potential improvements:

The protocol is a classic state machine problem, which might be nasty to debug in production; we should use postgres logging to record state transfer (basically the initial state + input) so devs could debug it later
The overall design is made for whole file based; for example, the primary key is the read cache filename; as I mentioned above, block-based read cache could have better performance, based on the workload characterization. In such case, it could be relatively adapted by simply use deterministic read cache block as the primary key, with little code change in the state machine implementation.

Testing:

The hardest part lies in testing, we should use mock to test every state transfer in unit test, rather than SQL test

0 replies

YuweiXiao · 2025-02-13T07:08:59Z

YuweiXiao
Feb 13, 2025

twos concerns regarding the RFC:

is this approach work with postgres standby instance? specifically, can we modify the unlogged cache table in the standby?
how to manage concurrent modifications to the cache table (under high QPS case)? i suspect dirty reads might be involved in the read-path, which feels somewhat hacky and introduce unexpected behavior. And letting a query to modify some DML internally also seems weird.

1 reply

dentiny Feb 13, 2025
Collaborator Author

Thanks for the comment, I've changed the design.

My PR only adds caching layer but not eviction
The need for read cache comes from duckdb httpfs not support caching, so I write my own extension, which evicts based on file access time; any feedback welcome :)

[RFC] Read cache for pg_mooncake on remote storage #69

Uh oh!

Uh oh!

dentiny Jan 6, 2025 Collaborator

Replies: 7 comments · 8 replies

Uh oh!

dpxcc Jan 7, 2025 Maintainer

Uh oh!

dentiny Jan 7, 2025 Collaborator Author

Uh oh!

dentiny Jan 7, 2025 Collaborator Author

Uh oh!

dpxcc Jan 8, 2025 Maintainer

Uh oh!

dentiny Jan 8, 2025 Collaborator Author

Uh oh!

Uh oh!

dentiny Jan 7, 2025 Collaborator Author

Uh oh!

dpxcc Jan 8, 2025 Maintainer

Uh oh!

dentiny Jan 8, 2025 Collaborator Author

Uh oh!

dpxcc Jan 8, 2025 Maintainer

Uh oh!

Uh oh!

ritwizsinha Jan 17, 2025

Uh oh!

Uh oh!

dentiny Jan 17, 2025 Collaborator Author

Uh oh!

dpxcc Jan 18, 2025 Maintainer

Uh oh!

dentiny Jan 24, 2025 Collaborator Author

Uh oh!

YuweiXiao Feb 13, 2025

Uh oh!

dentiny Feb 13, 2025 Collaborator Author

dentiny
Jan 6, 2025
Collaborator

Replies: 7 comments 8 replies

dpxcc
Jan 7, 2025
Maintainer

dentiny Jan 7, 2025
Collaborator Author

dentiny
Jan 7, 2025
Collaborator Author

dpxcc Jan 8, 2025
Maintainer

dentiny Jan 8, 2025
Collaborator Author

dentiny
Jan 7, 2025
Collaborator Author

dpxcc Jan 8, 2025
Maintainer

dentiny
Jan 8, 2025
Collaborator Author

dpxcc Jan 8, 2025
Maintainer

ritwizsinha
Jan 17, 2025

dentiny Jan 17, 2025
Collaborator Author

dpxcc Jan 18, 2025
Maintainer

dentiny
Jan 24, 2025
Collaborator Author

YuweiXiao
Feb 13, 2025

dentiny Feb 13, 2025
Collaborator Author