Skip to content

Conversation

@mohamedawnallah
Copy link
Contributor

Description

This PR improves error handling by failing gracefully when header writes fail. This helps prevent unexpected EOF errors during the process.

if err != nil {
t.Fatalf("Failed to read file: %v", err)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my other suggestion: can we also test that if we go down in a partially truncated state, then once we come up we're able to recover? May warrant an entirely distinct change.

Copy link
Contributor Author

@mohamedawnallah mohamedawnallah May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current understanding is:

  • A Truncate operation is a fatal error.
  • If a Truncate error occurs, it likely indicates a serious issue.
  • The probability of both a partial write failure and a truncate failure occurring together is probably unlikely and would typically require manual intervention.

Possible mitigation:

  • Peer Recovery of Invalid/Incomplete Tail Headers:
    • Request missing or invalid/incomplete tail header from peers upon detection on reads and index it in the store.
  • In-Place Recovery During Reads
    • If the data read is not a complete header (32 bytes for the filter header, 80 bytes for the index), remove it from the binary file.
    • The idea is to handle incomplete or invalid entries immediately (i.e., perform a delete operation during read), so the same issue doesn't recur on the next startup. This also include a delete operation in read operation
    • This serves as a recovery mechanism if the data is recoverable, depending on the mix of operations.
    • However, if the delete operation itself fails, it's unclear how recovery would be handled in that scenario.
  • If we simply choose to ignore the incomplete or invalid entry, we risk leaving the binary file in an inconsistent state: the invalid data would remain in the headers binary file, yet we might still be able to read it successfully.
  • This approach assumes the partial write entry was not fully truncated, since a truncate failure should have already been reported as an error.
  • My concern is that those methods might mask the root cause of the issue, rather than addressing them directly.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request missing or invalid/incomplete tail header from peers upon detection on reads and index it in the store.

I don't think this makes sense. At this point, we have already fetched the headers.

The probability of both a partial write failure and a truncate failure occurring together is probably unlikely and would typically require manual intervention.

This is precisely what we're attempting to solve. Consider that a user running w/o this patch might have a partial write. Today we require them to go in and delete the file manually. On a mobile platform, this isn't feasible for end users.

If we can make sure that both we write properly, and we can recover from botched writes (rn we fail on read), then we're able to cover all bases.

Copy link
Contributor Author

@mohamedawnallah mohamedawnallah May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider that a user running w/o this patch might have a partial write. Today we require them to go in and delete the file manually. On a mobile platform, this isn't feasible for end users. If we can make sure that both we write properly, and we can recover from botched writes (rn we fail on read), then we're able to cover all bases.

Thanks that makes sense. I have opened this issue so we could track the recovery on read for partial written headers #315. That said if you wanna me to approach that recovery on read in a specific way would be happy to hear

@mohamedawnallah mohamedawnallah force-pushed the failGracefullyOnHeaderWrite branch 3 times, most recently from 351e6cd to 91bd469 Compare May 13, 2025 07:16
@mohamedawnallah mohamedawnallah requested a review from Roasbeef May 13, 2025 07:33
@Roasbeef
Copy link
Member

Can we make a basic benchmark here to gauge if this affects write perf in a traditional set up?

@mohamedawnallah
Copy link
Contributor Author

mohamedawnallah commented May 13, 2025

Can we make a basic benchmark here to gauge if this affects write perf in a traditional set up?

That makes sense, thanks. I have added that BenchmarkHeaderStoreAppendRaw benchmarking test. Here is the report:

Without this PR (commit afcfeb1)

$ go test -bench=BenchmarkHeaderStoreAppendRaw -benchmem
goos: linux
goarch: amd64
pkg: github.com/lightninglabs/neutrino/headerfs
cpu: AMD EPYC-Rome-v4 Processor (no XSAVES)
BenchmarkHeaderStoreAppendRaw-4          1083670              1214 ns/op               0 B/op          0 allocs/op
PASS
ok      github.com/lightninglabs/neutrino/headerfs      2.521s

With this PR

$ go test -bench=BenchmarkHeaderStoreAppendRaw -benchmem
goos: linux
goarch: amd64
pkg: github.com/lightninglabs/neutrino/headerfs
cpu: AMD EPYC-Rome-v4 Processor (no XSAVES)
BenchmarkHeaderStoreAppendRaw-4           951910              1305 ns/op               0 B/op          0 allocs/op
PASS
ok      github.com/lightninglabs/neutrino/headerfs      2.401s

Benchmarking Report - Generated by Gemini

Based on the benchmark results:

Before the change:

  • 1,083,670 operations
  • 1,214 ns/op
  • 0 B/op (no memory allocations)

After the change:

  • 951,910 operations
  • 1,305 ns/op
  • 0 B/op (no memory allocations)

Analysis:

  1. Performance impact: The new implementation is about 7.5% slower (1,305 ns vs 1,214 ns per operation)
  2. Throughput decrease: The number of operations completed decreased by about 12.2%
  3. Memory usage: Both implementations are equally memory-efficient with zero allocations

Conclusion:
The added error handling with the Seek operation does introduce a small performance overhead of approximately 91 nanoseconds per operation. Given that:

  1. This is a very modest performance impact (~7.5%)
  2. The code now properly handles partial writes and maintains data integrity
  3. The error messages are much more informative
  4. There's still no memory allocation overhead

The tradeoff appears reasonable. The performance cost is minimal while the reliability improvements are significant. For a storage operation where data integrity is critical, this seems like an acceptable performance trade-off to gain much better error handling and recovery capability.

@mohamedawnallah
Copy link
Contributor Author

mohamedawnallah commented May 20, 2025

From the benchmarking results, it appears there's overhead on the seek operation. However, this is somehow misleading because the benchmark calls appendRaw on only a single header (creating a one-to-one mapping between seek operations and headers). In the actual codebase, this operation is performed on batches, meaning one seek operation handles multiple headers. Therefore, the observed 7.5% performance impact is significantly lower in real-world usage.

neutrino/headerfs/store.go

Lines 402 to 406 in c932ae4

// With all the headers written to the buffer, we'll now write out the
// entire batch in a single write call.
if err := h.appendRaw(headerBuf.Bytes()); err != nil {
return err
}

neutrino/headerfs/store.go

Lines 867 to 871 in c932ae4

// With all the headers written to the buffer, we'll now write out the
// entire batch in a single write call.
if err := f.appendRaw(headerBuf.Bytes()); err != nil {
return err
}

I think integrating this code change in the codebase would be fine. What do you think @Roasbeef?

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🌾

@ziggie1984 ziggie1984 self-requested a review May 21, 2025 08:12
Copy link
Contributor

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work ⚡️, had some some minor questions other than that that is good to go.

Could you also add a linter routine for 80 chars in one line so that we adhere to it across all our projects.


// BenchmarkHeaderStoreAppendRaw measures performance of headerStore.appendRaw
// by writing 80-byte headers to a file and resetting position between writes
// to isolate raw append performance from file size effects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we confident with the results of the benchmark, do we lose some performance now with the enhanced partial writing checking ?

Copy link
Contributor Author

@mohamedawnallah mohamedawnallah May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance impact depends on how many headers we receive from getheaders or getcfheaders. The maximum number of headers for both block headers and filter headers is configured as 2000 in btcsuite/btcd, which is also the limit Neutrino uses.

When I benchmarked appendRaw with 1000 rows before and after this patch, the performance was nearly the same. This is because the cost and frequency of the seek operation are significantly reduced.

However, in the worst case—when receiving only a single header via getheaders or getcfheaders—the performance cost is similar to the benchmark above, and we lose roughly 7% performance. This depends on various factors.

Overall, I think this is okay because we tradeoff a bit of optimal write performance for improved data integrity during writes.

Batched headers form btcsuite/btcd used in Neutrino

// MsgHeaders implements the Message interface and represents a bitcoin headers
// message. It delivers block header information in response to a getheaders message (MsgGetHeaders).
// The maximum number of block headers per message is currently 2000. See MsgGetHeaders for more details.
type MsgHeaders struct {
    Headers []*BlockHeader
}
// MsgCFHeaders implements the Message interface and represents a bitcoin cfheaders message.
// It delivers committed filter header information in response to a getcfheaders message (MsgGetCFHeaders).
// The maximum number of committed filter headers per message is currently 2000. See MsgGetCFHeaders for details.
type MsgCFHeaders struct {
    FilterType       FilterType
    StopHash         chainhash.Hash
    PrevFilterHeader chainhash.Hash
    FilterHashes     []*chainhash.Hash
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know what the defaults for bitcoin-core are ?

Copy link
Contributor Author

@mohamedawnallah mohamedawnallah Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know what the defaults for bitcoin-core are ?

It uses the following:

/** Maximum number of compact filters that may be requested with one getcfilters. See BIP 157. */
static constexpr uint32_t MAX_GETCFILTERS_SIZE = 1000;
/** Maximum number of cf hashes that may be requested with one getcfheaders. See BIP 157. */
static constexpr uint32_t MAX_GETCFHEADERS_SIZE = 2000;

@mohamedawnallah mohamedawnallah force-pushed the failGracefullyOnHeaderWrite branch 2 times, most recently from ea1821b to 3f075e0 Compare May 26, 2025 22:59
@mohamedawnallah
Copy link
Contributor Author

Thanks @ziggie1984 for the feedback. Regards the linter routine for 80 chars It looks like this would be blocking adding that check?

neutrino/.golangci.yml

Lines 25 to 27 in c932ae4

# Some lines are over 80 characters on purpose and we don't want to make them
# even longer by marking them as 'nolint'.
- lll

@ziggie1984
Copy link
Contributor

Added the column linter here: #318

This commit adds `FileInterface` that would makes it
easier to mock the behavior of file header store while testing.
@mohamedawnallah mohamedawnallah force-pushed the failGracefullyOnHeaderWrite branch 3 times, most recently from 9e7798c to 9b69343 Compare June 3, 2025 11:46
Copy link
Contributor

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending godoc.

This commit adds recovery mechanism from
failures that may happen during headers I/O write
@mohamedawnallah mohamedawnallah force-pushed the failGracefullyOnHeaderWrite branch from 9b69343 to b7bf07a Compare June 3, 2025 14:48
@guggero guggero added this pull request to the merge queue Jun 3, 2025
Merged via the queue into lightninglabs:master with commit 27977e1 Jun 3, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants