-
Notifications
You must be signed in to change notification settings - Fork 9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-16848. Refactoring: initial layering #1839
Draft
steveloughran
wants to merge
2
commits into
apache:trunk
Choose a base branch
from
steveloughran:s3/HADOOP-16848-refactoring-initial-layers
base: trunk
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
HADOOP-16848. Refactoring: initial layering #1839
steveloughran
wants to merge
2
commits into
apache:trunk
from
steveloughran:s3/HADOOP-16848-refactoring-initial-layers
+2,288
−436
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
e068532
to
94637de
Compare
94637de
to
c82e823
Compare
💔 -1 overall
This message was automatically generated. |
0dc715f
to
77c2d47
Compare
mockito test failure in async store launch
viewing this PR as a PoC, I think the async startup code needlessly complicates life. skipping bucket existence probes is enough to make for a fast launch. Proposed:
|
77c2d47
to
ba49cb7
Compare
|
💔 -1 overall
This message was automatically generated. |
First PoC of My planned layout model of the S3A FS. * There's a raw layer and a guarded layer * which are instantiated in sequence in a separate executor from S3AFileSystem.initalize And whose accessors block until completed or rethrow failures. The layers are being handed in all their dependencies from FS.initialize() and we currently block until started. What I plan to do, in a future iteration, is: * each layer extracts their own settings from the config and stores locally (list version, upload size etc) * have each layer instantiate their internal classes (AWS S3 client, transfer manager) internally * Also async create: metastore, DT binding * And all startup actions (check bucket, init multipart, ...) Then * move ops to the layers, raw* -> rawStore; inner -> S3AStore * move WriteOperationHelper, SelectBinding, etc, to all work against S3AStore rather than FS. S3AStore will become where most of the code moves to; S3AFilesystem more of the init and binding to hadoop FS API. RawS3A will be the accessor through which all AWS client access goes. Not going to change: all accessors on S3AFileSystem...not just tests use it but some external code (cloudstore) needs it to get at low level S3A, etc. Change-Id: I998c0d61cce2ee7fd0be804bf21da6b68fd69a6f HADOOP-16583 refactoring RequestFactory and RawS3A This moves most of the s3 client interaction into RawS3AImpl, Mainly just by moving the methods from S3AFileSystem. One key finding was that we can put all the code to create Request classes for the AWS SDK into its own factory, So ensure everything is set up consistently and keeping what is mainly housekeeping out of the way of everything else. We can do that rework immediately, as it doesn't require the rest of the layering. Also, the rawS3A() accessor no longer raises IOEs, it was Used into many places where that wasn't allowed. I sort of expected that. There's an awaitQuietly() (todo: change name) method which returns all failures as RTEs rather than IOEs. For the S3A FS API entry points, we still need to raise the normal startup exceptions, so we will need extra block/validate there after which, maybe, we can change the logic in these new getters just to raise an exception if their refs are null, rather than block. Change-Id: I3d15a46dd3034fc5d34dbaf01aaddba462d63a9d HADOOP-16583. refactoring -fix regression I'd pulled an import which wasn't used, but a recent change reinstated its need Change-Id: I12b0725a14a6e7bae1ecf93fcbd4d05e1eacb03e HADOOP-16848. Building against trunk Change-Id: Id1af0763998dd285bb2476298cdb94edf809b67c
RequestFactory adds factory construction and full set of operations. Moves all S3 IO into the RawS3A class, takes factory S3 select API call goes into the store remove attempt at async init. Too complex. Change-Id: I36bb2884ebe0d7183f99f25860f79e3796a112fb
ba49cb7
to
15f9adc
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
fs/s3
changes related to hadoop-aws; submitter must declare test endpoint
work in progress
PRs still Work in Progress; reviews not expected but still welcome
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A Refactoring of the S3A code structure
Goals: layer better for isolation, maintenance, testability, extra features
The latest version has the RawS3A and request factory design. The store layer, less so.
Given we don't need S3Guard forever, that split between store and raw s3 is potentially simpler. Do we need two layers? I'd say yes, for now
But looking at the RawS3A code, I think all the code in there doing retry, exception translation etc -that's kind of store level
Proposed
RawS3A is actually Object operations, including retries all and translation. Other than the factory and retry invoker, it is self-contained.
S3Store is the FS model but the internal API, not the Hadoop FS API. Its arguments are generally Paths and FS API operations
This relates to the idea of a context plugin
Complications
Maybe every RawS3A API Call takes a RequestContext with this and anything else?
A Refactoring of the S3A code structure
Goals: layer better for isolation, maintenance, testability, extra features
The latest version has the RawS3A and request factory design. The store layer, less so.
Given we don't need S3Guard forever, that split between store and raw s3 is potentially simpler. Do we need two layers? I'd say yes, for now
But looking at the RawS3A code, I think all the code in there doing retry, exception translation etc -that's kind of store level
Proposed
RawS3A is actually Object operations, including retries all and translation. Other than the factory and retry invoker, it is self-contained.
S3Store is the FS model but the internal API, not the Hadoop FS API. Its arguments are generally Paths and FS API operations
This relates to the idea of a context plugin
Complications
Maybe every RawS3A API Call takes a RequestContext with this and anything else?