headerfs: fail gracefully on header write #313

mohamedawnallah · 2025-05-08T04:15:14Z

Description

This PR improves error handling by failing gracefully when header writes fail. This helps prevent unexpected EOF errors during the process.

headerfs/store.go

headerfs/file_test.go

Roasbeef · 2025-05-12T23:26:50Z

headerfs/file_test.go

+			if err != nil {
+				t.Fatalf("Failed to read file: %v", err)
+			}
+


Related to my other suggestion: can we also test that if we go down in a partially truncated state, then once we come up we're able to recover? May warrant an entirely distinct change.

My current understanding is:

A Truncate operation is a fatal error.

If a Truncate error occurs, it likely indicates a serious issue.

The probability of both a partial write failure and a truncate failure occurring together is probably unlikely and would typically require manual intervention.

Possible mitigation:

Peer Recovery of Invalid/Incomplete Tail Headers:

Request missing or invalid/incomplete tail header from peers upon detection on reads and index it in the store.

In-Place Recovery During Reads

If the data read is not a complete header (32 bytes for the filter header, 80 bytes for the index), remove it from the binary file.

The idea is to handle incomplete or invalid entries immediately (i.e., perform a delete operation during read), so the same issue doesn't recur on the next startup. This also include a delete operation in read operation

This serves as a recovery mechanism if the data is recoverable, depending on the mix of operations.

However, if the delete operation itself fails, it's unclear how recovery would be handled in that scenario.

If we simply choose to ignore the incomplete or invalid entry, we risk leaving the binary file in an inconsistent state: the invalid data would remain in the headers binary file, yet we might still be able to read it successfully.

This approach assumes the partial write entry was not fully truncated, since a truncate failure should have already been reported as an error.

My concern is that those methods might mask the root cause of the issue, rather than addressing them directly.

What do you think?

Request missing or invalid/incomplete tail header from peers upon detection on reads and index it in the store.

I don't think this makes sense. At this point, we have already fetched the headers.

The probability of both a partial write failure and a truncate failure occurring together is probably unlikely and would typically require manual intervention.

This is precisely what we're attempting to solve. Consider that a user running w/o this patch might have a partial write. Today we require them to go in and delete the file manually. On a mobile platform, this isn't feasible for end users.

If we can make sure that both we write properly, and we can recover from botched writes (rn we fail on read), then we're able to cover all bases.

Consider that a user running w/o this patch might have a partial write. Today we require them to go in and delete the file manually. On a mobile platform, this isn't feasible for end users. If we can make sure that both we write properly, and we can recover from botched writes (rn we fail on read), then we're able to cover all bases.

Thanks that makes sense. I have opened this issue so we could track the recovery on read for partial written headers #315. That said if you wanna me to approach that recovery on read in a specific way would be happy to hear

headerfs/file_test.go

Roasbeef · 2025-05-13T18:00:38Z

Can we make a basic benchmark here to gauge if this affects write perf in a traditional set up?

mohamedawnallah · 2025-05-13T21:27:12Z

Can we make a basic benchmark here to gauge if this affects write perf in a traditional set up?

That makes sense, thanks. I have added that BenchmarkHeaderStoreAppendRaw benchmarking test. Here is the report:

Without this PR (commit `afcfeb1`)

$ go test -bench=BenchmarkHeaderStoreAppendRaw -benchmem
goos: linux
goarch: amd64
pkg: github.com/lightninglabs/neutrino/headerfs
cpu: AMD EPYC-Rome-v4 Processor (no XSAVES)
BenchmarkHeaderStoreAppendRaw-4          1083670              1214 ns/op               0 B/op          0 allocs/op
PASS
ok      github.com/lightninglabs/neutrino/headerfs      2.521s

With this PR

$ go test -bench=BenchmarkHeaderStoreAppendRaw -benchmem
goos: linux
goarch: amd64
pkg: github.com/lightninglabs/neutrino/headerfs
cpu: AMD EPYC-Rome-v4 Processor (no XSAVES)
BenchmarkHeaderStoreAppendRaw-4           951910              1305 ns/op               0 B/op          0 allocs/op
PASS
ok      github.com/lightninglabs/neutrino/headerfs      2.401s

Benchmarking Report - Generated by Gemini

Based on the benchmark results:

Before the change:

1,083,670 operations
1,214 ns/op
0 B/op (no memory allocations)

After the change:

951,910 operations
1,305 ns/op
0 B/op (no memory allocations)

Analysis:

Performance impact: The new implementation is about 7.5% slower (1,305 ns vs 1,214 ns per operation)
Throughput decrease: The number of operations completed decreased by about 12.2%
Memory usage: Both implementations are equally memory-efficient with zero allocations

Conclusion:
The added error handling with the Seek operation does introduce a small performance overhead of approximately 91 nanoseconds per operation. Given that:

This is a very modest performance impact (~7.5%)
The code now properly handles partial writes and maintains data integrity
The error messages are much more informative
There's still no memory allocation overhead

The tradeoff appears reasonable. The performance cost is minimal while the reliability improvements are significant. For a storage operation where data integrity is critical, this seems like an acceptable performance trade-off to gain much better error handling and recovery capability.

mohamedawnallah · 2025-05-20T02:02:39Z

From the benchmarking results, it appears there's overhead on the seek operation. However, this is somehow misleading because the benchmark calls appendRaw on only a single header (creating a one-to-one mapping between seek operations and headers). In the actual codebase, this operation is performed on batches, meaning one seek operation handles multiple headers. Therefore, the observed 7.5% performance impact is significantly lower in real-world usage.

neutrino/headerfs/store.go

Lines 402 to 406 in c932ae4

    
           // With all the headers written to the buffer, we'll now write out the 
        
           // entire batch in a single write call. 
        
           if err := h.appendRaw(headerBuf.Bytes()); err != nil { 
        
           	return err 
        
           }

neutrino/headerfs/store.go

Lines 867 to 871 in c932ae4

    
           // With all the headers written to the buffer, we'll now write out the 
        
           // entire batch in a single write call. 
        
           if err := f.appendRaw(headerBuf.Bytes()); err != nil { 
        
           	return err 
        
           }

I think integrating this code change in the codebase would be fine. What do you think @Roasbeef?

Roasbeef

LGTM 🌾

ziggie1984

Nice work ⚡️, had some some minor questions other than that that is good to go.

Could you also add a linter routine for 80 chars in one line so that we adhere to it across all our projects.

headerfs/file.go

headerfs/file_test.go

ziggie1984 · 2025-05-26T20:21:38Z

headerfs/file_test.go

+
+// BenchmarkHeaderStoreAppendRaw measures performance of headerStore.appendRaw
+// by writing 80-byte headers to a file and resetting position between writes
+// to isolate raw append performance from file size effects.


are we confident with the results of the benchmark, do we lose some performance now with the enhanced partial writing checking ?

The performance impact depends on how many headers we receive from getheaders or getcfheaders. The maximum number of headers for both block headers and filter headers is configured as 2000 in btcsuite/btcd, which is also the limit Neutrino uses.

When I benchmarked appendRaw with 1000 rows before and after this patch, the performance was nearly the same. This is because the cost and frequency of the seek operation are significantly reduced.

However, in the worst case—when receiving only a single header via getheaders or getcfheaders—the performance cost is similar to the benchmark above, and we lose roughly 7% performance. This depends on various factors.

Overall, I think this is okay because we tradeoff a bit of optimal write performance for improved data integrity during writes.

Batched headers form btcsuite/btcd used in Neutrino

// MsgHeaders implements the Message interface and represents a bitcoin headers // message. It delivers block header information in response to a getheaders message (MsgGetHeaders). // The maximum number of block headers per message is currently 2000. See MsgGetHeaders for more details. type MsgHeaders struct { Headers []*BlockHeader }

// MsgCFHeaders implements the Message interface and represents a bitcoin cfheaders message. // It delivers committed filter header information in response to a getcfheaders message (MsgGetCFHeaders). // The maximum number of committed filter headers per message is currently 2000. See MsgGetCFHeaders for details. type MsgCFHeaders struct { FilterType FilterType StopHash chainhash.Hash PrevFilterHeader chainhash.Hash FilterHashes []*chainhash.Hash }

do you know what the defaults for bitcoin-core are ?

do you know what the defaults for bitcoin-core are ?

It uses the following:

/** Maximum number of compact filters that may be requested with one getcfilters. See BIP 157. */ static constexpr uint32_t MAX_GETCFILTERS_SIZE = 1000; /** Maximum number of cf hashes that may be requested with one getcfheaders. See BIP 157. */ static constexpr uint32_t MAX_GETCFHEADERS_SIZE = 2000;

mohamedawnallah · 2025-05-26T23:02:34Z

Thanks @ziggie1984 for the feedback. Regards the linter routine for 80 chars It looks like this would be blocking adding that check?

neutrino/.golangci.yml

Lines 25 to 27 in c932ae4

    
           # Some lines are over 80 characters on purpose and we don't want to make them 
        
           # even longer by marking them as 'nolint'. 
        
           - lll

ziggie1984 · 2025-06-02T07:33:04Z

Added the column linter here: #318

This commit adds `FileInterface` that would makes it easier to mock the behavior of file header store while testing.

ziggie1984

LGTM, pending godoc.

headerfs/file_test.go

This commit adds recovery mechanism from failures that may happen during headers I/O write

saubyk assigned mohamedawnallah May 8, 2025

mohamedawnallah mentioned this pull request May 8, 2025

headerfs: improve error messages for header store #314

Merged

Roasbeef reviewed May 12, 2025

View reviewed changes

mohamedawnallah force-pushed the failGracefullyOnHeaderWrite branch 3 times, most recently from 351e6cd to 91bd469 Compare May 13, 2025 07:16

mohamedawnallah requested a review from Roasbeef May 13, 2025 07:33

mohamedawnallah mentioned this pull request May 13, 2025

headerfs: Implement Recovery on Read for Corrupted Headers #315

Open

mohamedawnallah force-pushed the failGracefullyOnHeaderWrite branch 2 times, most recently from 8d5bacd to 0743efd Compare May 13, 2025 21:21

Roasbeef approved these changes May 20, 2025

View reviewed changes

ziggie1984 self-requested a review May 21, 2025 08:12

ziggie1984 reviewed May 26, 2025

View reviewed changes

mohamedawnallah force-pushed the failGracefullyOnHeaderWrite branch 2 times, most recently from ea1821b to 3f075e0 Compare May 26, 2025 22:59

mohamedawnallah requested a review from ziggie1984 May 26, 2025 23:04

mohamedawnallah added 2 commits June 2, 2025 20:01

headerfs: add File interface

9622226

This commit adds `FileInterface` that would makes it easier to mock the behavior of file header store while testing.

headerfs: add string representation for HeaderType

dff3e02

mohamedawnallah force-pushed the failGracefullyOnHeaderWrite branch 3 times, most recently from 9e7798c to 9b69343 Compare June 3, 2025 11:46

ziggie1984 approved these changes Jun 3, 2025

View reviewed changes

headerfs/file_test.go Show resolved Hide resolved

headerfs: fail gracefully on write

b7bf07a

This commit adds recovery mechanism from failures that may happen during headers I/O write

mohamedawnallah force-pushed the failGracefullyOnHeaderWrite branch from 9b69343 to b7bf07a Compare June 3, 2025 14:48

guggero added this pull request to the merge queue Jun 3, 2025

Merged via the queue into lightninglabs:master with commit 27977e1 Jun 3, 2025
4 checks passed

mohamedawnallah mentioned this pull request Jun 14, 2025

headerfs: sync file after writing raw header in appendRaw function #296

Draft

headerfs: fail gracefully on header write #313

headerfs: fail gracefully on header write #313

Uh oh!

Conversation

mohamedawnallah commented May 8, 2025

Description

Uh oh!

Uh oh!

Uh oh!

Roasbeef May 12, 2025

Choose a reason for hiding this comment

Uh oh!

mohamedawnallah May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Roasbeef May 13, 2025

Choose a reason for hiding this comment

Uh oh!

mohamedawnallah May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Roasbeef commented May 13, 2025

Uh oh!

mohamedawnallah commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Without this PR (commit afcfeb1)

With this PR

Benchmarking Report - Generated by Gemini

Uh oh!

mohamedawnallah commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Roasbeef left a comment

Choose a reason for hiding this comment

Uh oh!

ziggie1984 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ziggie1984 May 26, 2025

Choose a reason for hiding this comment

Uh oh!

mohamedawnallah May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Batched headers form btcsuite/btcd used in Neutrino

Uh oh!

ziggie1984 Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

mohamedawnallah Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mohamedawnallah commented May 26, 2025

Uh oh!

ziggie1984 commented Jun 2, 2025

Uh oh!

ziggie1984 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mohamedawnallah May 13, 2025 •

edited

Loading

mohamedawnallah May 13, 2025 •

edited

Loading

mohamedawnallah commented May 13, 2025 •

edited

Loading

Without this PR (commit `afcfeb1`)

mohamedawnallah commented May 20, 2025 •

edited

Loading

mohamedawnallah May 26, 2025 •

edited

Loading

Batched headers form `btcsuite/btcd` used in Neutrino

mohamedawnallah Jun 3, 2025 •

edited

Loading

ziggie1984 left a comment •

edited

Loading