Skip to content

Commit 5105c7f

Browse files
gnailzenhMichael Hennecke
andauthored
DAOS-17337 doc: update 2.6.5 relnotes (#18410)
Update docs/release/release_notes.md for 2.6.5 Signed-off-by: Liang Zhen <gnailzenh@gmail.com> Signed-off-by: Michael Hennecke <michael.hennecke@hpe.com> Co-authored-by: Michael Hennecke <michael.hennecke@hpe.com>
1 parent e626ae8 commit 5105c7f

2 files changed

Lines changed: 215 additions & 11 deletions

File tree

docs/release/known-issues-265.md

Lines changed: 0 additions & 11 deletions
This file was deleted.

docs/release/release_notes.md

Lines changed: 215 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,221 @@
22

33
We are pleased to announce the release of DAOS version 2.6.
44

5+
## DAOS Version 2.6.5 (2026-06-05)
6+
7+
The DAOS 2.6.5 release includes the daos-2.6.5 RPM packages and their prerequisites.
8+
It contains the following updates on top of DAOS 2.6.4:
9+
10+
* Libfabric has been updated to 1.22.0-5.
11+
* Mercury has been updated to version 2.4.1.
12+
* PMDK has been updated to version 2.1.3.
13+
14+
### Bug fixes and improvements
15+
16+
The DAOS 2.6.5 release includes fixes and improvements in the following areas:
17+
18+
#### Mercury and Libfabric
19+
20+
* OFI/Libfabric multi-receive is now enabled for supported providers.
21+
22+
* The libfabric plugin for Mercury is now shipped in a separate `mercury-libfabric` RPM
23+
(in previous releases, it was included in the base `mercury` RPM).
24+
25+
#### Rebuild
26+
27+
* Introduce a centralized migration resource manager per target that enforces global limits on
28+
ULT count and DMA buffer usage across all pools, preventing overallocation during
29+
concurrent multi-pool rebuilds (DAOS-18192).
30+
* Process migrated object IDs directly in main xstreams instead of routing through system
31+
xstreams, eliminating expensive B+ tree operations in the previous round-trip path (DAOS-17928).
32+
* When the PS leader retries rebuild or reclaim on the same pool map version, bump the rebuild
33+
generation so targets can distinguish the new attempt; also abort the object scan sooner
34+
when rebuild is interrupted to allow faster failover (DAOS-18976).
35+
* Delay rebuild scheduling by 5 seconds so rapid sequences of pool map updates (e.g., multi-rank
36+
exclude/drain) are merged into a single rebuild job instead of running serially (DAOS-18425).
37+
* Cache the object open handle per object for the rebuild puller instead of re-opening for
38+
each dkey migration, saving repeated layout computation overhead (DAOS-17444).
39+
* Make pool\_discard non-blocking by spawning a ULT and returning immediately; also throttle
40+
EC rebuild data consumption to one unit to avoid overwhelming targets (DAOS-18487).
41+
* Ensure the rebuild stable epoch is propagated via IV before migration starts,
42+
fixing an assertion failure where mpt\_max\_eph could be zero (DAOS-18747).
43+
* Retry rebuild data fetch indefinitely on ENOMEM instead of failing the rebuild,
44+
allowing the system to recover once memory pressure subsides (DAOS-18326).
45+
* When a single object rebuild fails, abort the entire rebuild early to prevent pool destroy
46+
timeouts; transition the rank domain status from DOWN to DOWNOUT afterward (DAOS-17736).
47+
* Fix use of ec\_agg\_boundary before validity check, and retry failed EC aggregation peer
48+
updates; also fix memory leaks of mo\_csum\_iov and mrones on error paths (DAOS-18368, DAOS-18544).
49+
* Clear stale IV cache entries before reintegration when a target is reintegrated without
50+
reboot, and refine error handling in cont\_agg\_eph\_sync path (DAOS-18154).
51+
52+
#### Object / Erasure Coding
53+
54+
* When unordered conditional modifications on a DTX non-leader cause epoch conflicts
55+
(DER\_TX\_RESTART), ask the client to retry with a random delay instead of immediately
56+
restarting the DTX, reducing repeated epoch collisions (DAOS-18889).
57+
* On DTX non-leader, detect resent RPCs for transactions already in "prepared" state after
58+
a leader switch, and reply directly to avoid misguiding lower-layer logic (DAOS-18785).
59+
* After engine restart, load the latest aggregation epoch from VOS instead of resetting to
60+
zero, so EC aggregation skips unchanged containers; also use the persisted ec\_agg\_eph\_boundary
61+
as the scan start epoch and prevent bumping the epoch after reset (DAOS-18161).
62+
* Fix two checksum bugs with non-power-of-2 chunk sizes on EC objects: align checksum
63+
computation with VOS-space offsets during rebuild migration, and widen extent offset types
64+
to uint64\_t to preserve the parity indicator bit in aggregation verification (DAOS-18524).
65+
* Yield the CPU more frequently during EC aggregation under space pressure to avoid holding
66+
the xstream for too long; also increase RPC retry latency and send DTX RPCs from the
67+
current ULT to reduce congestion from resent requests (DAOS-18607).
68+
* Fix incorrect object shard ID assembly in CPD RPC handler when punching multiple shards of
69+
the same object on the same VOS target, which caused some shards to leak (DAOS-18641).
70+
* Skip key checksum verification during value enumeration, fixing a crash where a dummy IOD
71+
passed to daos\_csummer\_verify\_iod caused ISA-L SHA-256 update to fail (DAOS-18356).
72+
73+
#### Placement
74+
75+
* Simplify the jump-map state machine for rebuild: properly set rebuilding/reintegrating flags
76+
for all target transitions and remove the requirement for an extra target when rebuilding
77+
DOWN-to-UP targets (DAOS-18487).
78+
* Remove expensive `_get_target/_get_dom` traversals in layout computation and use byte-level
79+
bitmap skipping instead of bit-by-bit checking, significantly improving performance for
80+
large objects (DAOS-17444).
81+
* Stop layout computation early for rebuild and EC aggregation—only; generate the layout for
82+
the requested redundancy group rather than the full object, saving significant CPU (DAOS-18607).
83+
84+
#### DTX (Distributed Transactions)
85+
86+
* Allow a DTX participant whose target was in rebuild/reint at transaction time to become
87+
the new DTX leader once its status returns to UPIN, resolving stuck transactions after
88+
failover (DAOS-18728).
89+
* Add the ability to discard invalid DTX records discovered by the consistency checker (ddb),
90+
enabling recovery from inconsistent transaction metadata (DAOS-16951).
91+
92+
#### VOS (Versioned Object Store)
93+
94+
* Cap the merged extent size at a defined threshold to prevent runaway memory consumption
95+
during extent coalescing (DAOS-18901).
96+
* Lower the anti-fragmentation system space reservation: reduce min from 2 GB to 600 MB and
97+
max from 10 GB to 6 GB while keeping the 5% ratio; also allow GC to use smaller credits
98+
when encountering ENOSPACE to reclaim space in tighter conditions (DAOS-17345, DAOS-18690).
99+
* Update PMDK to fix a heap\_curr\_allocated accounting underflow that could produce spurious
100+
"pool not closed" messages and incorrect free-space statistics (DAOS-18882).
101+
* Add missing btr\_node\_tx\_add() calls when changing btree node flags and during node splitting,
102+
preventing potential metadata inconsistencies; also stop ignoring umem return codes that signal
103+
a broken transaction, avoiding engine crashes (DAOS-18531, DAOS-17891).
104+
105+
#### Pool / Container
106+
107+
* Populate pool handles into the IV cache during PS leader step-up so that targets can access
108+
handles immediately after all pool services restart (DAOS-17351).
109+
* Retry pool handle IV fetches on transient errors and ensure
110+
ds\_pool\_iv\_conn\_hdls\_update is called even when no handles exist in the DB,
111+
fixing invalid IV entries that returned unexpected DER\_NOTLEADER (DAOS-18613).
112+
* Make the checkpoint ULT always yield when it fails to acquire DMA buffers, preventing it
113+
from blocking the xstream; also tune default checkpoint parameters (DAOS-18691, DAOS-18366).
114+
* Prevent ds\_rsvc\_start from inserting a pool service into the hash table while the ds\_pool
115+
is stopping, fixing a hang during concurrent pool create and pool stop (DAOS-18552).
116+
* Allow a pool to start even when some shards are missing, so pool operations like
117+
`dmg pool list` and `dmg pool query` remain functional during partial outages (DAOS-18036).
118+
* Break the infinite IV retry loop in cont\_track\_eph\_leader\_ult when the ULT needs to exit,
119+
preventing pool service leader hangs after network outages (DAOS-18240).
120+
* When pool self\_heal is set to "exclude" (no rebuild), still allow admin-initiated manual
121+
rebuilds; also support "none" as a valid self\_heal property value (DAOS-15993, DAOS-17973).
122+
* Treat a rank as failed in cont\_agg\_eph\_sync if all its targets are excluded, so the EC
123+
aggregation boundary epoch is correctly propagated to other engines (DAOS-18157).
124+
125+
#### DFS (DAOS File System)
126+
127+
* Harden error handling and fix memory leaks across multiple DFS entry points including
128+
lookup, readdir, and object attribute operations (DAOS-18697).
129+
* If dir-oclass is set to EC on container create, fall back to the default; `daos fs set-attr`
130+
with an EC class now applies only to files while directories use the default (DAOS-18604).
131+
* Correctly display directory and file object classes in `daos fs get-attr`; also add DFS
132+
chunk size selection for RF3 containers with 3-parity EC (DAOS-17523, DAOS-18171).
133+
134+
#### CaRT (Transport)
135+
136+
* Remove the logic that forces tagged unexpected messages for non-CXI/TCP providers, enabling
137+
OFI multi-recv on InfiniBand and improving message throughput (DAOS-18484).
138+
* Add a tunable SWIM\_SUBGROUP\_SIZE environment variable to control the number of indirect-ping
139+
targets; also log new SWIM suspicions at INFO level with origins for easier debugging
140+
(DAOS-17405).
141+
* Remove legacy PSR (Primary Service Rank) code that was only used by CaRT tests and samples;
142+
improve context destroy sequencing to avoid cleanup races (DAOS-17114, DAOS-13887).
143+
* Add resubmitted out-of-quota RPCs to the timeout tracking list so they can be properly
144+
timed out; fix corpc error handling that caused double-reply or refcount leaks on middle
145+
nodes (DAOS-17470, DAOS-17861).
146+
147+
#### BIO (Block I/O)
148+
149+
* Monitor inflight SPDK I/Os and raise a RAS event if any I/O is not completed within a
150+
configurable timeout (120 s default via DAOS\_SPDK\_IO\_TIMEOUT),
151+
marking the device as faulty (DAOS-17607).
152+
* Enable the auto-faulty reaction with default thresholds by default; previously it required
153+
explicit opt-in via daos\_nvme.conf (DAOS-18337).
154+
* Flush the WAL header (persisting the last checkpointed ID) before unmapping the
155+
checkpointed region, preventing stale WAL replay if the engine crashes in between
156+
(DAOS-17628).
157+
* Remove the legacy si\_unused\_id rollback on WAL commit failure, which violated the invariant
158+
that new transaction IDs must exceed the last checkpointed ID (DAOS-18615).
159+
160+
#### Control Plane
161+
162+
* Refuse to start daos\_server when Transparent Huge Pages (THP) are enabled, as THP causes
163+
SPDK hugepage fragmentation and memory accounting issues (DAOS-17468).
164+
* Scale `dmg pool` command timeouts proportionally with the number of ranks in the system to
165+
avoid premature timeouts on large clusters (DAOS-18366).
166+
* Restrict Excluded ranks to only transition to Joined or AdminExcluded states, preventing
167+
confusing state changes when excluded engines SIGKILL themselves (DAOS-17643).
168+
* Remove stale SPDK lockfiles both on engine exit and before/after NVMe local scans to avoid
169+
scan failures on restart (DAOS-17341, DAOS-17935).
170+
* Add ComponentServer gRPC authorization for dmg system drain/reintegrate/self-heal/rebuild
171+
commands so server-to-server calls succeed in certificate mode (DAOS-18198).
172+
* Fix reintegration error handling: add "failout" to MGMT\_TGT\_CREATE CoRPC to avoid leaking
173+
pools on ranks being reintegrated; also fix pool create cleanup logic (DAOS-17600, DAOS-18162).
174+
* Fail protocol query after all engines have been tried instead of retrying indefinitely,
175+
avoiding an infinite flood of errors when engines are offline (DAOS-18167).
176+
* Improve engine start issue handling: skip stuck bio\_xsctxt\_free loops after SPDK init
177+
failure, restrict nonexistent-child tolerance to CR mode, add a RAS event on pool start
178+
failures, and skip pools during engine start if needed (DAOS-17442, DAOS-17305).
179+
180+
#### Consistency Checker
181+
182+
* When a dRPC upcall to the control plane fails, properly remove the pending interaction record
183+
from the check instance tree before destroying it, preventing assertion failures; also fix
184+
container label boundary checking for non-null-terminated buffers (DAOS-18587).
185+
* Handle rank death events from both SWIM and CaRT process group notifications to ensure no
186+
event is missed during consistency repair operations (DAOS-18238).
187+
* Destroy check instance after check cleanup; filter out repeated pools in the check list to
188+
avoid redundant processing (DAOS-18441, DAOS-17822).
189+
190+
#### Utilities / Common
191+
192+
* ddb: Replace device offline, zero-length key fix, interpret key by type (DAOS-17180, DAOS-16963,
193+
DAOS-18625).
194+
* Fix rare DAV VOS heap chunk metadata state during engine restart that could cause
195+
future allocations to abort with an assertion failure (DAOS-18195).
196+
* Redirect PMDK internal error and warning messages to VOS logging instead of stderr,
197+
making them visible in the DAOS log with proper filtering (DAOS-16661).
198+
* Run vos\_pool\_create and vos\_pool\_open in deep-stack ULTs to prevent pmemobj\_create/open
199+
from overflowing the caller's stack (DAOS-18296).
200+
201+
### Known Issues
202+
203+
An issue with memory registration handling in the libfabric cxi provider
204+
may cause DER\_NOMEM errors during rebuild. This issue is fixed in
205+
libfabric PR https://github.com/ofiwg/libfabric/pull/11908,
206+
which has been landed in libfabric but is not included in the libfabric
207+
version shipped with the current Slingshot Host Stack (SHS).
208+
The workaround is to install the latest libfabric, which includes
209+
this PR (DAOS-18326).
210+
211+
The default value for NA\_OFI\_UNEXPECTED\_TAG\_MSG was 1 until DAOS 2.6.4
212+
but has been changed to 0 in DAOS 2.6.5.
213+
In environments with different code versions on the clients and servers,
214+
the same value needs to be set to allow clients and servers to communicate.
215+
The preferred value for both is 0, so no action is needed on 2.6.5
216+
but NA\_OFI\_UNEXPECTED\_TAG\_MSG=0 must be explicitly set on 2.6.4
217+
or older (DAOS-18964).
218+
219+
5220
## DAOS Version 2.6.4 (2025-10-29, updated 2025-11-04)
6221

7222
The DAOS 2.6.4 release includes the daos-2.6.4-7 RPM packages and their prerequisites.

0 commit comments

Comments
 (0)