-
Notifications
You must be signed in to change notification settings - Fork 3.3k
HBASE-29255: Integrate backup WAL cleanup logic with the delete command #7007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: HBASE-28957
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
c033241
to
bdc0de5
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me overall, aside from one nit comment.
One more thing, do you think a lot of these System.err/out.println()
statements can be replaced with LOG.info/error()
? I know we want to give some feedback to the user via the Terminal, but it seems like a lot of these messages should go to the log (like the messages in BackupCommands.updateBackupTableStartTimes()
, BackupCommands. deleteOldWALFiles()
, etc.)
Configuration conf = getConf() != null ? getConf() : HBaseConfiguration.create(); | ||
String backupWalDir = conf.get(CONF_CONTINUOUS_BACKUP_WAL_DIR); | ||
|
||
if (backupWalDir == null || backupWalDir.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - You can use Strings.isNullOrEmpty()
from org.apache.hbase.thirdparty.com.google.common.base
if (backupWalDir == null || backupWalDir.isEmpty()) { | |
if (Strings.isNullOrEmpty(backupWalDir)) { |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Good point. we have lot of println lines everywhere in backup and restore code. let me create a new Jira to address this issue. |
return; | ||
} | ||
|
||
try (Connection conn = ConnectionFactory.createConnection(conf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: avoid using generic name like conn, use specific like masterConn
return; | ||
} | ||
|
||
try (Connection conn = ConnectionFactory.createConnection(conf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This connection creation is unnecessary I feel. Super class already has a connection open. Please verify If you can reuse it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, we'll reuse that!
// If WAL files of that day are older than cutoff time, delete them | ||
if (dayStart + ONE_DAY_IN_MILLISECONDS - 1 < cutoffTime) { | ||
System.out.println("Deleting outdated WAL directory: " + dirPath); | ||
fs.delete(dirPath, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is an api to delete in batches, we should use it. Also based on the nos of the file you are deleting this method can take lot of time. May be we can asynchronous here. Please give a thought
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is an api to delete in batches, we should use it.
Yeah, I checked but couldn’t find any API that supports batch deletion.
Also based on the nos of the file you are deleting this method can take lot of time. May be we can asynchronous here. Please give a thought
About going async — it’s a good idea, but it might add some complexity. We’d need to track if the delete actually finished, retry on failure, and maybe notify the user when it’s done.
So we should probably think about whether the added complexity is worth the gain. Also, right now, all our backup and restore commands (like full backup, incremental, restore) are synchronous anyway, and those can take hours.
I think async is definitely a good direction — just that it probably makes sense to build a proper framework around it first, so we can handle retries, tracking, and notifications across the board. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets build a job co-ordinator framework with zookeeper. We should build that outside the scope of this ticket off course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me create a jira for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point guys, but before going down this rabbit hole, please do some performance tests for justification. Try to delete 100, 10000 and 1 million files in a single directory and share how much time does it take synchronously. Delete/unlink operations should be relatively quick in any filesystem, but let's see how it works with S3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One discussion point. One change request.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
*/ | ||
public void updateContinuousBackupTableSet(Set<TableName> tablesToUpdate, long newStartTimestamp) | ||
throws IOException { | ||
try (Table table = connection.getTable(tableName)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Add a null check for tablesToUpdate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more minor comment, But overall LGTM
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vinayakphegde, patch looks good to me. However I have the same criticism that I mentioned previously: unit tests are missing.
Since all of your helper methods are private you cannot test them individually, so you need to set up an entire starship in your test case, call the command and verify the output. This is end 2 end testing. You will get a yes/no answer to your question about whether my function is working. If the answer is yes, we're fine, but if it's no, you'll have no idea about where the problem is and you have to debug.
Unit testing individual methods gives more detail about what's working and what's not.
// If WAL files of that day are older than cutoff time, delete them | ||
if (dayStart + ONE_DAY_IN_MILLISECONDS - 1 < cutoffTime) { | ||
System.out.println("Deleting outdated WAL directory: " + dirPath); | ||
fs.delete(dirPath, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point guys, but before going down this rabbit hole, please do some performance tests for justification. Try to delete 100, 10000 and 1 million files in a single directory and share how much time does it take synchronously. Delete/unlink operations should be relatively quick in any filesystem, but let's see how it works with S3.
/** | ||
* Updates the start time for continuous backups if older than cutoff timestamp. | ||
* @param sysTable Backup system table | ||
* @param cutoffTimestamp Timestamp before which WALs are no longer needed | ||
*/ | ||
private void updateBackupTableStartTimes(BackupSystemTable sysTable, long cutoffTimestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @vinayakphegde, this is the function that led me to ask for clarification on why we need to update the start times of the continuous backups. Maybe you could add another line or two to the docstring here that elaborates on why we need to do this? That may make it more clear to others in the future.
No description provided.