Skip to content

HBASE-29255: Integrate backup WAL cleanup logic with the delete command #7007

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: HBASE-28957
Choose a base branch
from

Conversation

vinayakphegde
Copy link
Contributor

No description provided.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

Copy link

@kgeisz kgeisz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me overall, aside from one nit comment.

One more thing, do you think a lot of these System.err/out.println() statements can be replaced with LOG.info/error()? I know we want to give some feedback to the user via the Terminal, but it seems like a lot of these messages should go to the log (like the messages in BackupCommands.updateBackupTableStartTimes(), BackupCommands. deleteOldWALFiles(), etc.)

Configuration conf = getConf() != null ? getConf() : HBaseConfiguration.create();
String backupWalDir = conf.get(CONF_CONTINUOUS_BACKUP_WAL_DIR);

if (backupWalDir == null || backupWalDir.isEmpty()) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - You can use Strings.isNullOrEmpty() from org.apache.hbase.thirdparty.com.google.common.base

Suggested change
if (backupWalDir == null || backupWalDir.isEmpty()) {
if (Strings.isNullOrEmpty(backupWalDir)) {

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@vinayakphegde
Copy link
Contributor Author

vinayakphegde commented Jun 3, 2025

One more thing, do you think a lot of these System.err/out.println() statements can be replaced with LOG.info/error()? I know we want to give some feedback to the user via the Terminal, but it seems like a lot of these messages should go to the log (like the messages in BackupCommands.updateBackupTableStartTimes(), BackupCommands. deleteOldWALFiles(), etc.)

Good point. we have lot of println lines everywhere in backup and restore code. let me create a new Jira to address this issue.

return;
}

try (Connection conn = ConnectionFactory.createConnection(conf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: avoid using generic name like conn, use specific like masterConn

return;
}

try (Connection conn = ConnectionFactory.createConnection(conf);
Copy link
Contributor

@abhradeepkundu abhradeepkundu Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This connection creation is unnecessary I feel. Super class already has a connection open. Please verify If you can reuse it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, we'll reuse that!

// If WAL files of that day are older than cutoff time, delete them
if (dayStart + ONE_DAY_IN_MILLISECONDS - 1 < cutoffTime) {
System.out.println("Deleting outdated WAL directory: " + dirPath);
fs.delete(dirPath, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is an api to delete in batches, we should use it. Also based on the nos of the file you are deleting this method can take lot of time. May be we can asynchronous here. Please give a thought

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is an api to delete in batches, we should use it.

Yeah, I checked but couldn’t find any API that supports batch deletion.

Also based on the nos of the file you are deleting this method can take lot of time. May be we can asynchronous here. Please give a thought

About going async — it’s a good idea, but it might add some complexity. We’d need to track if the delete actually finished, retry on failure, and maybe notify the user when it’s done.

So we should probably think about whether the added complexity is worth the gain. Also, right now, all our backup and restore commands (like full backup, incremental, restore) are synchronous anyway, and those can take hours.

I think async is definitely a good direction — just that it probably makes sense to build a proper framework around it first, so we can handle retries, tracking, and notifications across the board. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets build a job co-ordinator framework with zookeeper. We should build that outside the scope of this ticket off course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me create a jira for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point guys, but before going down this rabbit hole, please do some performance tests for justification. Try to delete 100, 10000 and 1 million files in a single directory and share how much time does it take synchronously. Delete/unlink operations should be relatively quick in any filesystem, but let's see how it works with S3.

Copy link
Contributor

@abhradeepkundu abhradeepkundu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One discussion point. One change request.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

*/
public void updateContinuousBackupTableSet(Set<TableName> tablesToUpdate, long newStartTimestamp)
throws IOException {
try (Table table = connection.getTable(tableName)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Add a null check for tablesToUpdate

Copy link
Contributor

@abhradeepkundu abhradeepkundu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more minor comment, But overall LGTM

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 43s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
_ HBASE-28957 Compile Tests _
+1 💚 mvninstall 4m 2s HBASE-28957 passed
+1 💚 compile 0m 33s HBASE-28957 passed
-0 ⚠️ checkstyle 0m 11s /buildtool-branch-checkstyle-hbase-backup.txt The patch fails to run checkstyle in hbase-backup
+1 💚 spotbugs 0m 34s HBASE-28957 passed
+1 💚 spotless 0m 46s branch has no errors when running spotless:check.
_ Patch Compile Tests _
+1 💚 mvninstall 3m 1s the patch passed
+1 💚 compile 0m 30s the patch passed
-0 ⚠️ javac 0m 30s /results-compile-javac-hbase-backup.txt hbase-backup generated 4 new + 109 unchanged - 0 fixed = 113 total (was 109)
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 9s /buildtool-patch-checkstyle-hbase-backup.txt The patch fails to run checkstyle in hbase-backup
+1 💚 spotbugs 0m 36s the patch passed
+1 💚 hadoopcheck 12m 5s Patch does not cause any errors with Hadoop 3.3.6 3.4.0.
+1 💚 spotless 0m 45s patch has no errors when running spotless:check.
_ Other Tests _
+1 💚 asflicense 0m 10s The patch does not generate ASF License warnings.
31m 53s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7007/5/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #7007
JIRA Issue HBASE-29255
Optional Tests dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname Linux 3f85f005e67d 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision HBASE-28957 / 3655d48
Default Java Eclipse Adoptium-17.0.11+9
Max. process+thread count 84 (vs. ulimit of 30000)
modules C: hbase-backup U: hbase-backup
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7007/5/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 48s Docker mode activated.
-0 ⚠️ yetus 0m 4s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ HBASE-28957 Compile Tests _
+1 💚 mvninstall 4m 27s HBASE-28957 passed
+1 💚 compile 0m 37s HBASE-28957 passed
+1 💚 javadoc 0m 32s HBASE-28957 passed
+1 💚 shadedjars 8m 18s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 4m 17s the patch passed
+1 💚 compile 0m 34s the patch passed
+1 💚 javac 0m 34s the patch passed
+1 💚 javadoc 0m 25s the patch passed
+1 💚 shadedjars 8m 34s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 💚 unit 25m 26s hbase-backup in the patch passed.
55m 1s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7007/5/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR #7007
JIRA Issue HBASE-29255
Optional Tests javac javadoc unit compile shadedjars
uname Linux 2cf9fb5e70ef 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision HBASE-28957 / 3655d48
Default Java Eclipse Adoptium-17.0.11+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7007/5/testReport/
Max. process+thread count 3543 (vs. ulimit of 30000)
modules C: hbase-backup U: hbase-backup
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7007/5/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@anmolnar anmolnar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vinayakphegde, patch looks good to me. However I have the same criticism that I mentioned previously: unit tests are missing.

Since all of your helper methods are private you cannot test them individually, so you need to set up an entire starship in your test case, call the command and verify the output. This is end 2 end testing. You will get a yes/no answer to your question about whether my function is working. If the answer is yes, we're fine, but if it's no, you'll have no idea about where the problem is and you have to debug.

Unit testing individual methods gives more detail about what's working and what's not.

// If WAL files of that day are older than cutoff time, delete them
if (dayStart + ONE_DAY_IN_MILLISECONDS - 1 < cutoffTime) {
System.out.println("Deleting outdated WAL directory: " + dirPath);
fs.delete(dirPath, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point guys, but before going down this rabbit hole, please do some performance tests for justification. Try to delete 100, 10000 and 1 million files in a single directory and share how much time does it take synchronously. Delete/unlink operations should be relatively quick in any filesystem, but let's see how it works with S3.

Comment on lines +947 to +952
/**
* Updates the start time for continuous backups if older than cutoff timestamp.
* @param sysTable Backup system table
* @param cutoffTimestamp Timestamp before which WALs are no longer needed
*/
private void updateBackupTableStartTimes(BackupSystemTable sysTable, long cutoffTimestamp)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @vinayakphegde, this is the function that led me to ask for clarification on why we need to update the start times of the continuous backups. Maybe you could add another line or two to the docstring here that elaborates on why we need to do this? That may make it more clear to others in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants