Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dbsync] Fix open issues in rsync library and make it functional #782

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

AchuthaSourabhC
Copy link

@AchuthaSourabhC AchuthaSourabhC commented Mar 8, 2025

This PR fixes all the open issues to get the library functional

Issues being fixed

  1. First sync failed when file is not present in target, we have to create and directly copy the file first time
  2. Exception was thrown if file size was less than min file size constant, in this case file can just be copied directly
  3. Staging files (checksum, instruction,staged) were not stored in proper directory structure and would cause problems when syncing multiple files
  4. once rsycn was done and staged file is created based on instructions it was note being moved/copied to target and target file was left out of data
  5. Staging files were not getting cleaned up at the end
  6. If trailing slash was not present in staging path invalid staging file paths were getting created. ( need to add if missing)
  7. Cloud run job were not deleted at the end

Other changes

  1. Added a param to decide if staging files should be deleted (its helpful to not delete for debugging and testing)
  2. Fixed formatting and naming issues ( lot of the updates are related to this)

Testing:
test manually with following params

--project=achuthasourabh-test --location=us-west1 --target_file=gs://rsync_target/data/test.csv --source_file=file:///usr/local/google/home/achuthasourabh/Downloads/annual-enterprise-survey-2023-financial-year-provisional-size-bands.csv --staging_bucket=gs://rsync_staging --delete_staging_files=false

Screenshot of staging bucket files
image

Screenshot of target file
image

Note: Need to setup proper integ/e2e tests but for now trying to get it working and unblock dependent work

@AchuthaSourabhC AchuthaSourabhC force-pushed the update-rsync-staging-logic branch from 156aeb9 to d39671f Compare March 8, 2025 15:26
@AchuthaSourabhC AchuthaSourabhC changed the title [b/400338195] Fix how staging files are stored and target is updated [dbsync] Fix how staging files are stored and target is updated Mar 9, 2025
@AchuthaSourabhC AchuthaSourabhC changed the title [dbsync] Fix how staging files are stored and target is updated [dbsync] Fix open issues in rsync library and make it functional e2e Mar 9, 2025
@AchuthaSourabhC AchuthaSourabhC changed the title [dbsync] Fix open issues in rsync library and make it functional e2e [dbsync] Fix open issues in rsync library and make it functional Mar 9, 2025
@AchuthaSourabhC AchuthaSourabhC marked this pull request as ready for review March 9, 2025 11:31
try {
instruction.writeDelimitedTo(instructionStream);
} catch (IOException e) {
throw new RuntimeException("Failed to write instructions", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will still delete the intermediate files like checksum or instructions if it throws here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking whats best way to handle this because failure can happen at different stages and may be some of the files we are deleting are not even created.
One option is we do cleanup on best effort in a finally block at top level that encompasses all exceptions. But ignore any error while trying to delete including file not found errors.

}

// Reconstruct on GCS
server.reconstruct();

//TODO delete the jobs from cloudrun
server.deleteJob(Mode.GENERATE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the goal to have these running as servers instead of as a one time command line program later?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge we wanted to run it as jobs running on demand, since frequency of execution might actually be less.
Though i was checking if there was a way to re-use deployed jobs and run with different inputs instead of deploying new jobs for each file. But for now couldn't find good solution for this.

project, location, Mode.GENERATE,
String.format("%s && %s", jarDownloadCommand, generateCommand)
);
project,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest separating these formatting changes next time. It makes it harder to see what are real changes and what are just style changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes will keep in mind. I initially had setup "format on save" and didn't realize would end up with big diff

// If target files path is gs://target_bucket/dir/target/file.txt
// Staging bucket is gs://target_bucket/
// Then staging files will be stored as gs://target_bucket/dir/target/file.txt/staged
String targetRelativePath = UriUtil.getRelativePath(targetUri);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is used only once a line below, you can inline it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes will update

// TODO: Switch to simple write to save storage and read by byte size
checksum.writeDelimitedTo(checksumStream);
} catch (IOException e) {
throw new RuntimeException("Failed to generate checksum", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't delete the staging files in this case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to other comment #782 (comment)
Need a good way cleanup regardless of the error

@@ -55,4 +63,14 @@ public ByteSource getInstructionsByteSource() {
public ByteSink getInstructionsByteSink() {
return storage.newByteSink(instructionUri);
}

@Override
public void moveStagedFileToTarget(boolean deleteStagingFiles) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to have a deleteStagedFiles method instead of passing in boolean flag here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya make sense. Will update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants