Skip to content

Conversation

@devabhishekpal
Copy link

What changes were proposed in this pull request?

Snapshot is not an atomic operation and in some cases there might be corrupted snapshots.
Every snapshot has an associated MD5 file with it, if this file is missing we can skip the snapshot.

This PR adds the check for the existence of MD5 file and skips any snapshot which doesn't have it.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/RATIS-244

How was this patch tested?

Patch was tested manually by running the unit tests.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devabhishekpal , thanks for working on this! Just have a minor comment inlined.

final Matcher matcher = SNAPSHOT_REGEX.matcher(filename.toString());
if (matcher.matches()) {
// If the file doesn't have an MD5 hash it would doesn't need to be matched as it might be corrupted
if (MD5FileUtil.getDigestFileForFile(filename.toFile()).exists() && matcher.matches()) {
Copy link
Contributor

@szetszwo szetszwo Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's print a warning: -- it is not a good idea since it may print a lot of messages.

Just found that the method is called also by cleanupOldSnapshots(..). So, we need more changes as below:

diff --git a/ratis-server/src/main/java/org/apache/ratis/statemachine/impl/SimpleStateMachineStorage.java b/ratis-server/src/main/java/org/apache/ratis/statemachine/impl/SimpleStateMachineStorage.java
index 0ca6734a0..e20ac9115 100644
--- a/ratis-server/src/main/java/org/apache/ratis/statemachine/impl/SimpleStateMachineStorage.java
+++ b/ratis-server/src/main/java/org/apache/ratis/statemachine/impl/SimpleStateMachineStorage.java
@@ -82,7 +82,7 @@ public class SimpleStateMachineStorage implements StateMachineStorage {
     // TODO
   }
 
-  static List<SingleFileSnapshotInfo> getSingleFileSnapshotInfos(Path dir) throws IOException {
+  static List<SingleFileSnapshotInfo> getSingleFileSnapshotInfos(Path dir, boolean requireMd5) throws IOException {
     final List<SingleFileSnapshotInfo> infos = new ArrayList<>();
     try (DirectoryStream<Path> stream = Files.newDirectoryStream(dir)) {
       for (Path path : stream) {
@@ -90,10 +90,14 @@ public class SimpleStateMachineStorage implements StateMachineStorage {
         if (filename != null) {
           final Matcher matcher = SNAPSHOT_REGEX.matcher(filename.toString());
           if (matcher.matches()) {
+            final boolean hasMd5 = MD5FileUtil.getDigestFileForFile(filename.toFile()).exists();
+            if (requireMd5 && !hasMd5) {
+              continue;
+            }
             final long term = Long.parseLong(matcher.group(1));
             final long index = Long.parseLong(matcher.group(2));
             final FileInfo fileInfo = new FileInfo(path, null); //No FileDigest here.
-            infos.add(new SingleFileSnapshotInfo(fileInfo, term, index));
+            infos.add(new SingleFileSnapshotInfo(fileInfo, term, index, hasMd5));
           }
         }
       }
@@ -114,11 +118,23 @@ public class SimpleStateMachineStorage implements StateMachineStorage {
       return;
     }
 
-    final List<SingleFileSnapshotInfo> allSnapshotFiles = getSingleFileSnapshotInfos(stateMachineDir.toPath());
+    final List<SingleFileSnapshotInfo> allSnapshotFiles = getSingleFileSnapshotInfos(stateMachineDir.toPath(), false);
+    allSnapshotFiles.sort(Comparator.comparing(SingleFileSnapshotInfo::getIndex).reversed());
 
-    if (allSnapshotFiles.size() > numSnapshotsRetained) {
-      allSnapshotFiles.sort(Comparator.comparing(SingleFileSnapshotInfo::getIndex).reversed());
-      allSnapshotFiles.subList(numSnapshotsRetained, allSnapshotFiles.size())
+    int numSnapshotsWithMd5 = 0;
+    int deleteIndex = -1;
+    for(int i = 0; i < allSnapshotFiles.size(); i++) {
+      if (allSnapshotFiles.get(i).hasMd5()) {
+        numSnapshotsWithMd5++;
+        if (numSnapshotsWithMd5 == numSnapshotsRetained) {
+          deleteIndex = i+1;
+          break;
+        }
+      }
+    }
+
+    if (deleteIndex > 0) { // keep at least one snapshot
+      allSnapshotFiles.subList(deleteIndex, allSnapshotFiles.size())
           .stream()
           .map(SingleFileSnapshotInfo::getFile)
           .map(FileInfo::getPath)
@@ -182,7 +198,7 @@ public class SimpleStateMachineStorage implements StateMachineStorage {
   }
 
   static SingleFileSnapshotInfo findLatestSnapshot(Path dir) throws IOException {
-    final Iterator<SingleFileSnapshotInfo> i = getSingleFileSnapshotInfos(dir).iterator();
+    final Iterator<SingleFileSnapshotInfo> i = getSingleFileSnapshotInfos(dir, true).iterator();
     if (!i.hasNext()) {
       return null;
     }
@@ -199,7 +215,7 @@ public class SimpleStateMachineStorage implements StateMachineStorage {
     final Path path = latest.getFile().getPath();
     final MD5Hash md5 = MD5FileUtil.readStoredMd5ForFile(path.toFile());
     final FileInfo info = new FileInfo(path, md5);
-    return new SingleFileSnapshotInfo(info, latest.getTerm(), latest.getIndex());
+    return new SingleFileSnapshotInfo(info, latest.getTerm(), latest.getIndex(), true);
   }
 
   public SingleFileSnapshotInfo updateLatestSnapshot(SingleFileSnapshotInfo info) {
diff --git a/ratis-server/src/main/java/org/apache/ratis/statemachine/impl/SingleFileSnapshotInfo.java b/ratis-server/src/main/java/org/apache/ratis/statemachine/impl/SingleFileSnapshotInfo.java
index 14d501a4a..f59d5ba77 100644
--- a/ratis-server/src/main/java/org/apache/ratis/statemachine/impl/SingleFileSnapshotInfo.java
+++ b/ratis-server/src/main/java/org/apache/ratis/statemachine/impl/SingleFileSnapshotInfo.java
@@ -28,12 +28,23 @@ import org.apache.ratis.server.storage.FileInfo;
  * The objects of this class are immutable.
  */
 public class SingleFileSnapshotInfo extends FileListSnapshotInfo {
+  private final Boolean hasMd5;
+
   public SingleFileSnapshotInfo(FileInfo fileInfo, TermIndex termIndex) {
+    this(fileInfo, termIndex, null);
+  }
+
+  public SingleFileSnapshotInfo(FileInfo fileInfo, TermIndex termIndex, Boolean hasMd5) {
     super(Collections.singletonList(fileInfo), termIndex);
+    this.hasMd5 = hasMd5;
+  }
+
+  public SingleFileSnapshotInfo(FileInfo fileInfo, long term, long endIndex, boolean hasMd5) {
+    this(fileInfo, TermIndex.valueOf(term, endIndex), hasMd5);
   }
 
-  public SingleFileSnapshotInfo(FileInfo fileInfo, long term, long endIndex) {
-    this(fileInfo, TermIndex.valueOf(term, endIndex));
+  public boolean hasMd5() {
+    return hasMd5 != null && hasMd5;
   }
 
   /** @return the file associated with the snapshot. */

@szetszwo
Copy link
Contributor

@devabhishekpal , could you also add some tests to TestRaftStorage?

@devabhishekpal devabhishekpal marked this pull request as ready for review November 21, 2025 19:20
@szetszwo
Copy link
Contributor

@devabhishekpal , thanks for the update! The change looks good but there are test failures. Please take a look.

@devabhishekpal
Copy link
Author

It seems the tests are failing because in findLatestSnapshot() we are doing:

getSingleFileSnapshotInfos(dir, true).iterator()

Now if we create an initial snapshot with md5 file, this will return null from the check:

if (!i.hasNext()) {
      return null;
}

We now create a second snapshot with no MD5 hash, this will still return null as the iterator still has one element.
Hence when we run a scenario as below:

try {
      createSnapshot(simpleStateMachineStorage, 1, 100, true);
      simpleStateMachineStorage.loadLatestSnapshot();

      createSnapshot(simpleStateMachineStorage, 1, 200, false);
      simpleStateMachineStorage.loadLatestSnapshot();

      SingleFileSnapshotInfo latest = simpleStateMachineStorage.getLatestSnapshot();
      Assertions.assertNotNull(latest);
}

latest is always returned as null even though there are two snapshots present.
Previous behaviour would have been that it would have returned one snapshot.

This is also causing gRPC test failures as SimpleStateMachineStorage.findLatestSnapshot() filters out every snapshot that lacks an MD5 file as we are passing (requireMd5 == true). So on the follower that just copied the snapshot, findLatestSnapshot now returns null - loadLatestSnapshot() never updates the cache, and the snapshot install is effectively ignored.
Hence follower never recognises the installed snapshot, it fails to catch up and doesn’t participate in the new configuration. setConfiguration therefore keeps retrying and times-out

@szetszwo do you think we might need to change the test case? Or should the behaviour be changed?

@szetszwo
Copy link
Contributor

latest is always returned as null even though there are two snapshots present.

It was a bug previously that it was using filename but not path. Is it still happening after the fix?

@devabhishekpal
Copy link
Author

Yup, we are still seeing the same issue

@szetszwo
Copy link
Contributor

... we are still seeing the same issue

Could you debug it and find out why?

@szetszwo
Copy link
Contributor

createSnapshot(simpleStateMachineStorage, 1, 100, true);

@devabhishekpal , your test creating fake MD5 file is causing exception

2025-11-29 13:43:15,570 [main] WARN  impl.SimpleStateMachineStorage (SimpleStateMachineStorage.java:loadLatestSnapshot(262)) - Failed to updateLatestSnapshot from target/test/data/81dc258c/TestRaftStorage/TestRaftStorage.testCreateSnapshot/sm
java.io.IOException: Invalid MD5 file target/test/data/81dc258c/TestRaftStorage/TestRaftStorage.testCreateSnapshot/sm/snapshot.1_100.md5: the content "null" does not match the pattern ([0-9a-f]{32}) [ *](.+)
	at org.apache.ratis.util.MD5FileUtil.lambda$readStoredMd5ForFile$0(MD5FileUtil.java:99)
	at java.base/java.util.Optional.orElseThrow(Optional.java:408)
	at org.apache.ratis.util.MD5FileUtil.readStoredMd5ForFile(MD5FileUtil.java:98)
	at org.apache.ratis.statemachine.impl.SimpleStateMachineStorage.findLatestSnapshot(SimpleStateMachineStorage.java:223)
	at org.apache.ratis.statemachine.impl.SimpleStateMachineStorage.loadLatestSnapshot(SimpleStateMachineStorage.java:258)
	at org.apache.ratis.server.storage.TestRaftStorage.testCreateSnapshot(TestRaftStorage.java:101)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants