Skip to content

Commit 88d6124

Browse files
author
David Bailey
committed
v0.5.6: wear level unification, btrfs fallback, installer dedup
Integration: ATA wear level values inverted to percentage-used scale (matching NVMe). Sensor renamed to Wear Level (% Used). ATA SSD wear added to attention logic with >= 90% threshold. Agent: btrfs statvfs fallback via btrfs-progs when statvfs returns zero. Parser and tests included. Installer: mountinfo-based bind mount dedup by (source, fstype, root) composite key. Path unescaping for kernel-escaped mountinfo fields. Bind mount hiding behind y/N prompt. btrfs picker display fallback. Post-install summary now shows the mDNS-advertised interface IP instead of the first IP from hostname -I. Docs: updated attention-severity-logic, early-warning-attributes, and smart-attribute-name-variants for unified wear semantics. README note for btrfs-progs dependency. BREAKING: ATA SSD wear sensor values invert (e.g. 99 -> 1 for new drives). See CHANGELOG for migration notes.
1 parent 0edda3e commit 88d6124

12 files changed

Lines changed: 795 additions & 31 deletions

CHANGELOG.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,29 @@
22

33
All notable changes to SMART Sniffer are documented here.
44

5+
## v0.5.6 -- 2026-05-02
6+
7+
Agent + integration release. Both components updated.
8+
9+
### Fixed
10+
- **Wear level sensor now reports consistent "percentage used" across ATA and NVMe drives** -- ATA SSDs report a normalized SMART value where 100 means "new" and 0 means "worn." NVMe drives report `percentage_used` where 0 means "new" and 100 means "worn" -- the opposite scale. Previously, the integration passed both values through as-is, so the same sensor meant opposite things depending on drive protocol. Now ATA values are inverted to match NVMe: 0% = new drive, 100% = fully worn. **Breaking change:** if you have automations based on the ATA wear sensor, your values will invert (e.g., a new Samsung 870 EVO previously showed 99, now shows 1).
11+
- **Installer filesystem picker: bind mount deduplication** -- the picker now parses `/proc/self/mountinfo` (when available) and deduplicates entries by `(source, fstype, root)`. Previously, bind mounts of the same filesystem appeared as separate entries, tripling the list on systems like ZimaOS. Falls back to `/proc/mounts` and then `mount` on systems where mountinfo is not available.
12+
- **Installer filesystem picker: path unescaping** -- mount paths containing spaces, tabs, newlines, or backslashes are now displayed correctly. The kernel escapes these characters in mountinfo/proc output (`\040` for space, etc.) and the installer previously showed the raw escaped strings.
13+
- **Installer summary now shows the correct IP** -- the post-install summary previously showed the first IP from `hostname -I`, which on systems with Docker bridges or virtual interfaces was often an unreachable internal IP (e.g., `172.18.0.1`). It now shows the IP of the mDNS-advertised interface you selected during setup.
14+
15+
### Added
16+
- **ATA SSD wear now triggers attention warnings** -- ATA SSDs with wear level at 90% or higher (after inversion to "percentage used") now fire a WARNING in the Attention Needed sensor, matching the existing NVMe threshold. Previously only NVMe drives got wear-based attention warnings.
17+
- **btrfs filesystem fallback** -- when `statvfs` returns zero for a btrfs mount (a known quirk on some multi-device or DUP-profile configurations), the agent falls back to `btrfs filesystem usage --raw` for accurate size/usage data. Requires `btrfs-progs` to be installed (most btrfs systems have it). Without it, the mount reports as `(unknown size)` in the picker and zero-byte usage from the API.
18+
- **Installer: bind mount hiding** -- bind mounts of subdirectories are hidden by default behind a `[+N bind mounts hidden]` tag with a y/N prompt to reveal them. Reduces noise on systems with many bind mounts.
19+
20+
### Changed
21+
- **Sensor name:** "Wear Leveling / Percentage Used" renamed to "Wear Level (% Used)" for clarity.
22+
23+
### Upgrade Notes
24+
- **Both agent and integration should be updated.** Replace the agent binary or re-run the installer. Update the integration via HACS or manually.
25+
- **Wear sensor breaking change:** ATA SSD wear values are inverted. If you have automations checking wear level, review your thresholds. The sensor now consistently means "percentage of rated life consumed" for both ATA and NVMe. A new drive reads ~0-1%, a heavily worn drive reads 90%+.
26+
- **btrfs users:** install `btrfs-progs` if not already present for accurate filesystem reporting. The agent works without it but btrfs mounts will show zero usage.
27+
528
## v0.5.5.5 -- 2026-04-27
629

730
Installer-only patch. No agent, integration, or config changes.

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -224,6 +224,8 @@ Binaries output to `agent/build/`.
224224

225225
**Requires:** `smartmontools` **7.0+** on each monitored machine (for JSON output support). The installer handles installation automatically (Homebrew on macOS, apt/dnf/yum on Linux), but some older distros ship smartctl 6.x which does not support the `--json` flag the agent relies on. Run `smartctl --version` to check. If you're on 6.x, install a newer version from the [smartmontools releases page](https://www.smartmontools.org/wiki/Download) or from a backports repository.
226226

227+
**Optional:** `btrfs-progs` is recommended on systems with btrfs filesystems. The installer's disk-usage picker and the agent's `/api/filesystems` endpoint both fall back to `btrfs filesystem usage --raw` when `statvfs` returns zero on a btrfs mount (a known quirk on some multi-device or near-full configurations). Without `btrfs-progs`, btrfs entries display as `(unknown size)` in the picker and report zero-byte usage from the API. Most distros include `btrfs-progs` by default if any btrfs filesystems exist on the system.
228+
227229
### 1. Install the agent
228230

229231
Run on each machine you want to monitor:

agent/filesystem_btrfs.go

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
//go:build !windows
2+
3+
// Phase 1A: btrfs statvfs fallback.
4+
//
5+
// Some btrfs configurations (multi-device, certain kernel versions,
6+
// near-full single-disk) cause syscall.Statfs to return zero values
7+
// where df-style tools would report real numbers. When that happens,
8+
// we fall back to parsing `btrfs filesystem usage --raw <path>`.
9+
//
10+
// This is a fallback, not the primary source. statvfs is microseconds;
11+
// a subprocess is milliseconds and forks a child. We only invoke btrfs
12+
// when statvfs has clearly failed (total==0 on a btrfs mount).
13+
//
14+
// Three failure modes, each with a distinct log line so users can
15+
// diagnose without reading source:
16+
// - btrfs-progs not installed
17+
// - subprocess timed out (5s)
18+
// - output didn't parse
19+
//
20+
// All three fall through to the original statvfs values (zero); the
21+
// /api/filesystems endpoint reports zeros and continues working.
22+
package main
23+
24+
import (
25+
"bytes"
26+
"context"
27+
"errors"
28+
"fmt"
29+
"os/exec"
30+
"regexp"
31+
"strconv"
32+
"time"
33+
)
34+
35+
// btrfsFallbackTimeout is the maximum wall time we'll allow the
36+
// `btrfs filesystem usage --raw` subprocess. A hung btrfs binary
37+
// must not block the agent's poll cycle.
38+
const btrfsFallbackTimeout = 5 * time.Second
39+
40+
// Sentinel errors for the three documented failure modes. Callers
41+
// distinguish via errors.Is to emit the right log message.
42+
var (
43+
errBtrfsProgsMissing = errors.New("btrfs-progs not installed")
44+
errBtrfsTimeout = errors.New("btrfs filesystem usage timed out")
45+
errBtrfsParse = errors.New("btrfs filesystem usage parse error")
46+
)
47+
48+
// btrfsUsage holds the three values we extract from --raw output.
49+
type btrfsUsage struct {
50+
Total uint64
51+
Used uint64
52+
Available uint64
53+
}
54+
55+
// Anchor patterns for the bare lines in the Overall: block of
56+
// `btrfs filesystem usage --raw` output. Each must end after the
57+
// digits to avoid colliding with the per-block-group lines such as
58+
// "Data,single: Size:N, Used:N (62.40%)" -- those have "Used:" mid-line.
59+
var (
60+
reBtrfsDeviceSize = regexp.MustCompile(`(?m)^\s*Device size:\s+(\d+)\s*$`)
61+
reBtrfsUsed = regexp.MustCompile(`(?m)^\s*Used:\s+(\d+)\s*$`)
62+
// "Free (estimated):" has an optional trailing "(min: N)" parenthetical.
63+
// We only want the first integer; the "min:" value is conservative
64+
// scheduling info we don't expose.
65+
reBtrfsFreeEst = regexp.MustCompile(`(?m)^\s*Free \(estimated\):\s+(\d+)`)
66+
)
67+
68+
// tryBtrfsFallback runs `btrfs filesystem usage --raw <path>` and
69+
// parses the result. Returns the typed error sentinels documented
70+
// above so the caller can log the three distinct messages.
71+
func tryBtrfsFallback(path string) (btrfsUsage, error) {
72+
// Cheap pre-check: if the binary isn't even on PATH, fail fast
73+
// with the specific sentinel. exec.LookPath is microseconds.
74+
if _, err := exec.LookPath("btrfs"); err != nil {
75+
return btrfsUsage{}, errBtrfsProgsMissing
76+
}
77+
78+
ctx, cancel := context.WithTimeout(context.Background(), btrfsFallbackTimeout)
79+
defer cancel()
80+
81+
cmd := exec.CommandContext(ctx, "btrfs", "filesystem", "usage", "--raw", path)
82+
var stdout, stderr bytes.Buffer
83+
cmd.Stdout = &stdout
84+
cmd.Stderr = &stderr
85+
if err := cmd.Run(); err != nil {
86+
// Distinguish timeout from other failures. A context-cancelled
87+
// CommandContext returns ctx.Err() via the Go stdlib.
88+
if ctx.Err() == context.DeadlineExceeded {
89+
return btrfsUsage{}, errBtrfsTimeout
90+
}
91+
// Any other run error (non-zero exit, permission denied,
92+
// disappeared mountpoint) is treated as a parse-class failure
93+
// from the caller's perspective. Wrap so the caller sees the
94+
// underlying cause if they choose to inspect it.
95+
return btrfsUsage{}, fmt.Errorf("%w: %v", errBtrfsParse, err)
96+
}
97+
98+
return parseBtrfsUsageRaw(stdout.Bytes())
99+
}
100+
101+
// parseBtrfsUsageRaw extracts Device size, Used, and Free (estimated)
102+
// from the --raw output. Exposed (unexported but package-visible) for
103+
// unit tests so we don't need a real btrfs binary to test parsing.
104+
func parseBtrfsUsageRaw(out []byte) (btrfsUsage, error) {
105+
totalMatch := reBtrfsDeviceSize.FindSubmatch(out)
106+
usedMatch := reBtrfsUsed.FindSubmatch(out)
107+
freeMatch := reBtrfsFreeEst.FindSubmatch(out)
108+
109+
if totalMatch == nil || usedMatch == nil {
110+
return btrfsUsage{}, fmt.Errorf("%w: missing Device size or Used line", errBtrfsParse)
111+
}
112+
113+
total, err := strconv.ParseUint(string(totalMatch[1]), 10, 64)
114+
if err != nil {
115+
return btrfsUsage{}, fmt.Errorf("%w: Device size not numeric: %v", errBtrfsParse, err)
116+
}
117+
used, err := strconv.ParseUint(string(usedMatch[1]), 10, 64)
118+
if err != nil {
119+
return btrfsUsage{}, fmt.Errorf("%w: Used not numeric: %v", errBtrfsParse, err)
120+
}
121+
122+
// Available is best-effort. If "Free (estimated):" is missing or
123+
// non-numeric, derive it from total-used. The endpoint contract
124+
// requires a value; an off-by-some on btrfs is acceptable given
125+
// btrfs's own Free estimation is itself an estimate.
126+
var available uint64
127+
if freeMatch != nil {
128+
if v, perr := strconv.ParseUint(string(freeMatch[1]), 10, 64); perr == nil {
129+
available = v
130+
}
131+
}
132+
if available == 0 && total > used {
133+
available = total - used
134+
}
135+
136+
return btrfsUsage{
137+
Total: total,
138+
Used: used,
139+
Available: available,
140+
}, nil
141+
}

agent/filesystem_btrfs_test.go

Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
//go:build !windows
2+
3+
package main
4+
5+
import (
6+
"errors"
7+
"os"
8+
"strings"
9+
"testing"
10+
)
11+
12+
// Real `btrfs filesystem usage --raw` output captured from David's
13+
// ZimaOS box (Brookdale NAS, /dev/md0). Kept inline rather than read
14+
// from disk so the test is hermetic. Mirrors
15+
// docs/internal/research/test-fixtures/zimaos-btrfs-usage.txt.
16+
const fixtureBtrfsUsageRaw = `Overall:
17+
Device size: 2000263643136
18+
Device allocated: 1968050339840
19+
Device unallocated: 32213303296
20+
Device missing: 0
21+
Device slack: 0
22+
Used: 1225996673024
23+
Free (estimated): 769020321792 (min: 752913670144)
24+
Free (statfs, df): 769019273216
25+
Data ratio: 1.00
26+
Metadata ratio: 2.00
27+
Global reserve: 536870912 (used: 0)
28+
Multiple profiles: no
29+
30+
Data,single: Size:1959599865856, Used:1222792847360 (62.40%)
31+
/dev/md0 1959599865856
32+
33+
Metadata,DUP: Size:4216848384, Used:1601634304 (37.98%)
34+
/dev/md0 8433696768
35+
36+
System,DUP: Size:8388608, Used:278528 (3.32%)
37+
/dev/md0 16777216
38+
39+
Unallocated:
40+
/dev/md0 32213303296
41+
`
42+
43+
func TestParseBtrfsUsageRaw_RealFixture(t *testing.T) {
44+
usage, err := parseBtrfsUsageRaw([]byte(fixtureBtrfsUsageRaw))
45+
if err != nil {
46+
t.Fatalf("expected success, got: %v", err)
47+
}
48+
49+
const (
50+
wantTotal = uint64(2000263643136)
51+
wantUsed = uint64(1225996673024)
52+
wantAvailable = uint64(769020321792)
53+
)
54+
if usage.Total != wantTotal {
55+
t.Errorf("Total = %d, want %d", usage.Total, wantTotal)
56+
}
57+
if usage.Used != wantUsed {
58+
t.Errorf("Used = %d, want %d", usage.Used, wantUsed)
59+
}
60+
if usage.Available != wantAvailable {
61+
t.Errorf("Available = %d, want %d", usage.Available, wantAvailable)
62+
}
63+
}
64+
65+
// Regression: the per-block-group lines have "Used:" mid-line (e.g.
66+
// "Data,single: Size:N, Used:N (62.40%)"). The Overall: parser must
67+
// only match the bare-line Used:, not these.
68+
func TestParseBtrfsUsageRaw_InlineUsedRegression(t *testing.T) {
69+
// Strip the Overall: block to verify the parser does NOT pick up
70+
// the inline Used field as a substitute.
71+
overallEnd := strings.Index(fixtureBtrfsUsageRaw, "\nData,single:")
72+
if overallEnd < 0 {
73+
t.Fatal("test fixture malformed: missing Data,single section marker")
74+
}
75+
withoutOverall := fixtureBtrfsUsageRaw[overallEnd:]
76+
77+
_, err := parseBtrfsUsageRaw([]byte(withoutOverall))
78+
if err == nil {
79+
t.Fatal("expected parse error when Overall: block is missing, got nil")
80+
}
81+
if !errors.Is(err, errBtrfsParse) {
82+
t.Errorf("expected errBtrfsParse, got %v", err)
83+
}
84+
}
85+
86+
func TestParseBtrfsUsageRaw_MissingDeviceSize(t *testing.T) {
87+
input := `Overall:
88+
Used: 1225996673024
89+
`
90+
_, err := parseBtrfsUsageRaw([]byte(input))
91+
if !errors.Is(err, errBtrfsParse) {
92+
t.Errorf("expected errBtrfsParse, got %v", err)
93+
}
94+
}
95+
96+
func TestParseBtrfsUsageRaw_MissingUsed(t *testing.T) {
97+
input := `Overall:
98+
Device size: 2000263643136
99+
`
100+
_, err := parseBtrfsUsageRaw([]byte(input))
101+
if !errors.Is(err, errBtrfsParse) {
102+
t.Errorf("expected errBtrfsParse, got %v", err)
103+
}
104+
}
105+
106+
func TestParseBtrfsUsageRaw_EmptyInput(t *testing.T) {
107+
_, err := parseBtrfsUsageRaw([]byte(""))
108+
if !errors.Is(err, errBtrfsParse) {
109+
t.Errorf("expected errBtrfsParse, got %v", err)
110+
}
111+
}
112+
113+
func TestParseBtrfsUsageRaw_NonNumericTotal(t *testing.T) {
114+
input := `Overall:
115+
Device size: NOTANUMBER
116+
Used: 1225996673024
117+
`
118+
// Regex requires \d+, so a non-numeric value won't even match the
119+
// capture group -- this tests that path through the error.
120+
_, err := parseBtrfsUsageRaw([]byte(input))
121+
if !errors.Is(err, errBtrfsParse) {
122+
t.Errorf("expected errBtrfsParse, got %v", err)
123+
}
124+
}
125+
126+
// Available falls back to Total - Used when Free (estimated) is missing.
127+
func TestParseBtrfsUsageRaw_AvailableFallback(t *testing.T) {
128+
input := `Overall:
129+
Device size: 1000
130+
Used: 300
131+
`
132+
usage, err := parseBtrfsUsageRaw([]byte(input))
133+
if err != nil {
134+
t.Fatalf("unexpected error: %v", err)
135+
}
136+
if usage.Total != 1000 {
137+
t.Errorf("Total = %d, want 1000", usage.Total)
138+
}
139+
if usage.Used != 300 {
140+
t.Errorf("Used = %d, want 300", usage.Used)
141+
}
142+
if usage.Available != 700 {
143+
t.Errorf("Available = %d, want 700 (Total - Used fallback)", usage.Available)
144+
}
145+
}
146+
147+
// errBtrfsProgsMissing is returned when btrfs is not on PATH. We
148+
// simulate this by setting PATH to a directory we know doesn't have
149+
// btrfs. Skip if the test environment doesn't allow PATH manipulation
150+
// (very rare but possible).
151+
func TestTryBtrfsFallback_BinaryMissing(t *testing.T) {
152+
origPath := os.Getenv("PATH")
153+
t.Cleanup(func() { os.Setenv("PATH", origPath) })
154+
155+
// Empty PATH guarantees exec.LookPath fails for "btrfs". We don't
156+
// need /tmp to be free of a btrfs binary -- empty PATH is enough.
157+
if err := os.Setenv("PATH", ""); err != nil {
158+
t.Skipf("cannot set PATH for test: %v", err)
159+
}
160+
161+
_, err := tryBtrfsFallback("/")
162+
if !errors.Is(err, errBtrfsProgsMissing) {
163+
t.Errorf("expected errBtrfsProgsMissing, got %v", err)
164+
}
165+
}
166+
167+
// Note on timeout testing: the timeout path requires a btrfs binary
168+
// that hangs longer than 5s. Constructing this hermetically would
169+
// require a test double that injects a fake runner via a package-level
170+
// hook. The current implementation uses exec.LookPath + exec.CommandContext
171+
// directly for clarity; if timeout flakiness is reported in production
172+
// we can refactor to inject a runner. For now the timeout sentinel is
173+
// covered by code review of the ctx.Err() == context.DeadlineExceeded
174+
// branch in tryBtrfsFallback.

agent/filesystem_unix.go

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
package main
44

55
import (
6+
"errors"
67
"log"
78
"syscall"
89
)
@@ -37,6 +38,34 @@ func (fc *FilesystemCache) Refresh() {
3738
info.UsedBytes = info.TotalBytes - freeBytes
3839
info.AvailableBytes = stat.Bavail * uint64(stat.Bsize)
3940

41+
// Phase 1A: btrfs statvfs fallback.
42+
//
43+
// We trigger fallback only when TotalBytes == 0 on a btrfs mount.
44+
// We do NOT broaden the trigger to "implausible non-zero" cases
45+
// (e.g. btrfs single-disk near-full overstating free). That would
46+
// fork a subprocess on every poll cycle for every btrfs mount,
47+
// which is wasteful. The CTO's panel point that btrfs CLI is the
48+
// more reliable source still stands -- this is a deliberate
49+
// performance/reliability tradeoff. See plan-btrfs-filesystem-
50+
// reporting.md for the full reasoning.
51+
if info.TotalBytes == 0 && cfg.FSType == "btrfs" {
52+
usage, err := tryBtrfsFallback(cfg.Path)
53+
switch {
54+
case err == nil:
55+
info.TotalBytes = usage.Total
56+
info.UsedBytes = usage.Used
57+
info.AvailableBytes = usage.Available
58+
log.Printf("filesystem: using btrfs-progs for %s (statvfs returned zero)", cfg.Path)
59+
case errors.Is(err, errBtrfsProgsMissing):
60+
log.Printf("filesystem: btrfs-progs not installed, returning statvfs zeros for %s", cfg.Path)
61+
case errors.Is(err, errBtrfsTimeout):
62+
log.Printf("filesystem: btrfs filesystem usage timed out after 5s for %s", cfg.Path)
63+
default:
64+
// Wraps errBtrfsParse or an exec error treated as parse-class.
65+
log.Printf("filesystem: btrfs filesystem usage parse error for %s: %v", cfg.Path, err)
66+
}
67+
}
68+
4069
if info.TotalBytes > 0 {
4170
info.UsePercent = float64(info.UsedBytes) / float64(info.TotalBytes) * 100.0
4271
// Round to one decimal place.

0 commit comments

Comments
 (0)