Skip to content

Corruption bug: log::Reader does not retry short reads #1285

@robofinch

Description

@robofinch

log::Reader uses the SequentialFile::Read method provided by some Env. It seems that the implementations used by LevelDB do at least retry reads on EINTR, but if an interrupt occurs after some but not all of the bytes were read, then the Read call may succeed but return fewer bytes than requested.

You can see that the below code assumes that a successful short read would only ever occur due to EOF:

leveldb/db/log_reader.cc

Lines 194 to 204 in ac69108

buffer_.clear();
Status status = file_->Read(kBlockSize, &buffer_, backing_store_);
end_of_buffer_offset_ += buffer_.size();
if (!status.ok()) {
buffer_.clear();
ReportDrop(kBlockSize, status);
eof_ = true;
return kEof;
} else if (buffer_.size() < kBlockSize) {
eof_ = true;
}

An implementation of SequentialFile::Read:

leveldb/util/env_posix.cc

Lines 142 to 157 in ac69108

Status Read(size_t n, Slice* result, char* scratch) override {
Status status;
while (true) {
::ssize_t read_size = ::read(fd_, scratch, n);
if (read_size < 0) { // Read error.
if (errno == EINTR) {
continue; // Retry
}
status = PosixError(filename_, errno);
break;
}
*result = Slice(scratch, read_size);
break;
}
return status;
}

This seems very bad. I'm not sure what the chance is of an interrupt actually occurring during a 32 KiB read, but any such interrupt could easily lead to data loss. log::Reader::ReadRecord, which can fall victim to the above bug, is called every time a database is opened, to read the current MANIFEST file (and any .log files).

  • If the bug occurs when opening a .log file, some recently-written data is lost (without even being reported as corruption).
  • If an interrupt happens at just the wrong time when opening the current MANIFEST file, it might be reported as corruption (resulting in the database failing to open) if the first VersionEdit of the MANIFEST failed to be read. This is probably the best-case scenario, since attempting to reopen the database would probably show that no data was actually lost.
  • If any VersionEdit after the first in the current MANIFEST fails to be read, it and any following ones are silently not read, losing who-knows-how-many recently-added table files, and possibly resulting in corruption being reported later when the VersionSet thinks some previously-deleted table files should still exist (because a VersionEdit indicating their deletion was not read).

Ideally, SequentialFile::Read should clearly document that its implementors are required to retry on EINTR or equivalent, and as existing implementations may have successful short reads for non-EOF reasons, log::Reader must read data until either a successful read of length 0 occurs (EOF), the full block is successfully read, or an error occurs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions