-
Notifications
You must be signed in to change notification settings - Fork 9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-14837 : Support Read Restored Glacier Objects #6407
base: trunk
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing, looks good mostly, sharing some initial feedback.
Let's get rid of the acceptor, update documentation and should add an ITest.
Test that for a glacier file, if it's READ_ALL, an error is thrown (I think that is existing behaviour?), if it's SKIP_ALL_GLACIER then it's skipped. and test restore behaviour too if possible.
/** | ||
* S3ObjectStorageClassFilter will filter the S3 files based on the fs.s3a.glacier.read.restored.objects configuration set in S3AFileSystem | ||
* The config can have 3 values: | ||
* READ_ALL: This would conform to the current default behavior of not taking into account the storage classes retrieved from S3. This will be done to keep the current behavior for the customers and not changing the experience for them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of saying current behaviour, can you specify what that current behaviour is. It errors right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure will update
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ObjectStorageClassFilter.java
Outdated
Show resolved
Hide resolved
hadoop-common-project/hadoop-common/src/main/resources/core-default.xml
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most of my comments are on the basic stuff, especially those test assertions and the need to have a single factored out assertion() method.
Now, can we have the storage class an attribute in S3AFileStatus? populated in listings and from HEAD requests? included in .toString()? that could be useful in future
@@ -18,6 +18,7 @@ | |||
|
|||
package org.apache.hadoop.fs.s3a; | |||
|
|||
import org.apache.hadoop.util.Lists; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should go with the rest of the hadoop imports. it's in the wrong place in a lot of files as the move off guava lists was a search and replace without reordering
* Accept all entries except Glacier objects if the config fs.s3a.glacier.read.restored.objects, | ||
* is set to SKIP_ALL_GLACIER | ||
*/ | ||
public static class GlacierStatusAcceptor implements FileStatusAcceptor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this storage class come in on a v2 LIST request? guess it must...
@@ -52,6 +52,8 @@ | |||
import java.util.concurrent.atomic.AtomicBoolean; | |||
import javax.annotation.Nullable; | |||
|
|||
import org.apache.hadoop.fs.s3a.Listing.AcceptAllButSelfAndS3nDirs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, import ordering
@@ -5213,7 +5223,7 @@ private RemoteIterator<S3ALocatedFileStatus> innerListFiles( | |||
RemoteIterator<S3ALocatedFileStatus> listFilesAssumingDir = | |||
listing.getListFilesAssumingDir(path, | |||
recursive, | |||
acceptor, | |||
acceptors, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: restore consistent indentation
|
||
|
||
/** | ||
* S3ObjectStorageClassFilter will filter the S3 files based on the fs.s3a.glacier.read.restored.objects configuration set in S3AFileSystem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be good to use
on lines and {@code } around formatted code so that javadocs and IDEs render better
RestoreStatus.builder().isRestoreInProgress(false).build(), | ||
ObjectStorageClass.GLACIER)); | ||
|
||
Assert.assertFalse(result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i require a meaningful error message on all failures. ideally using assertj whose describedAs() class can do string formatting.
o = getS3ObjectWithStorageClassAndRestoreStatus(...)
Assertions.assertThat(acceptor.accept(...))
.describedAs("accept %s", o)
.isFalse()
just imagine you've seen a test failure: what information would you want in the assertion message to begin debugging it?
RestoreStatus.builder().isRestoreInProgress(false).build(), | ||
ObjectStorageClass.DEEP_ARCHIVE)); | ||
|
||
Assert.assertFalse(result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same. you could actually have an assertAcceptance(object, outcome) method instead of all this duplication
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java
Outdated
Show resolved
Hide resolved
@@ -576,6 +580,10 @@ public void initialize(URI name, Configuration originalConf) | |||
|
|||
s3aInternals = createS3AInternals(); | |||
|
|||
s3ObjectStorageClassFilter = Optional.ofNullable(conf.get(READ_RESTORED_GLACIER_OBJECTS)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use getTrimmed(key, ""), toUpper() and then do the matching. fail meaningfully if the value isn't recognised.
Am converting the current test into an ITest, will update assertions according to the recommendations
This is something that can be done, do we want this as part of this PR ? Or a separate one adding the storage class to S3AFileStatus ? |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
4a9f038
to
f0144a7
Compare
🎊 +1 overall
This message was automatically generated. |
f0144a7
to
4e9b673
Compare
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good mostly, but the ITest needs work. will sync with you on that
@@ -581,6 +583,12 @@ public void initialize(URI name, Configuration originalConf) | |||
|
|||
s3aInternals = createS3AInternals(); | |||
|
|||
s3ObjectStorageClassFilter = Optional.ofNullable(conf.get(READ_RESTORED_GLACIER_OBJECTS)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would prefer conf.get(READ_RESTORED_GLACIER_OBJECTS, READ_ALL)
, meaning READ_ALL is the default. and then you can get rid of the orElse()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahmarsuhail but doing the the way it is does handle case differences.
I'd go for getTrimmed(READ_RESTORED_GLACIER_OBJECTS, ""); if empty string map to empty optional, otherwise .toupper and valueof. one thing to consider: meaningful failure if the value doesn't map.
I'd change Configuration to do that case mapping if it wasn't such a critical class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or we just go for "upper case is required" and use what you've proposed. more brittle but simpler?
* <pre> | ||
* {@link S3ObjectStorageClassFilter} will filter the S3 files based on the {@code fs.s3a.glacier.read.restored.objects} configuration set in {@link S3AFileSystem} | ||
* The config can have 3 values: | ||
* {@code READ_ALL}: This would conform to the current default behavior of not taking into account the storage classes retrieved from S3. This will be done to keep the current behavior for the customers and not changing the experience for them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove This will be done to keep the current behavior for the customers and not changing the experience for them.
and add something like "Retrieval of Galcier files will fail with xxx ` whatever the error is currently
@@ -411,4 +416,8 @@ public RequestFactory getRequestFactory() { | |||
public boolean isCSEEnabled() { | |||
return isCSEEnabled; | |||
} | |||
|
|||
public S3ObjectStorageClassFilter s3ObjectsStorageClassFilter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename to getS3ObjectStorageClassFilter(), and add java docs for the method
<description> | ||
The config can have 3 values: | ||
|
||
* READ_ALL: This would conform to the current default behavior of not taking into account the storage classes retrieved from S3. This will be done to keep the current behavior (i.e failing for an unrestored glacier class file) for the customers and not changing the experience for them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to above, remove This will be done to keep the current behavior for the customers and not changing the experience for them. and add something like "Retrieval of Galcier files will fail with xxx ` whatever the error is currently
} | ||
|
||
@Override | ||
protected Configuration createConfiguration() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move method to the top
newConf.set(STORAGE_CLASS, STORAGE_CLASS_GLACIER); // Create Glacier objects | ||
skipIfStorageClassTestsDisabled(newConf); | ||
disableFilesystemCaching(newConf); | ||
removeBaseAndBucketOverrides(newConf, STORAGE_CLASS, FAST_UPLOAD_BUFFER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove line 103 and 104, don't think they're needed ..
|
||
@Parameterized.Parameters(name = "fast-upload-buffer-{0}") | ||
public static Collection<Object[]> params() { | ||
return Arrays.asList(new Object[][]{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to parameterize here as we already test this in ITestS3AStorageClass. here we just want to focus on this glacier specific behaviour.
|
||
FileSystem fs = contract.getTestFileSystem(); | ||
Path dir = methodPath(); | ||
fs.mkdirs(dir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to test for this here
@Override | ||
protected Configuration createConfiguration() { | ||
Configuration newConf = super.createConfiguration(); | ||
newConf.set(STORAGE_CLASS, STORAGE_CLASS_GLACIER); // Create Glacier objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should test for Glaicer and deep archive though..as that's in your StorageClassFilterMap
4e9b673
to
8a70267
Compare
💔 -1 overall
This message was automatically generated. |
8a70267
to
d035d15
Compare
🎊 +1 overall
This message was automatically generated. |
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
Show resolved
Hide resolved
@@ -581,6 +583,12 @@ public void initialize(URI name, Configuration originalConf) | |||
|
|||
s3aInternals = createS3AInternals(); | |||
|
|||
s3ObjectStorageClassFilter = Optional.ofNullable(conf.get(READ_RESTORED_GLACIER_OBJECTS)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahmarsuhail but doing the the way it is does handle case differences.
I'd go for getTrimmed(READ_RESTORED_GLACIER_OBJECTS, ""); if empty string map to empty optional, otherwise .toupper and valueof. one thing to consider: meaningful failure if the value doesn't map.
I'd change Configuration to do that case mapping if it wasn't such a critical class
this.filter = filter; | ||
} | ||
|
||
private static boolean isNotGlacierObject(S3Object object) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add javadocs all the way down here, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure will add for methods in the class
@@ -25,6 +25,7 @@ | |||
import java.util.concurrent.CompletableFuture; | |||
import java.util.concurrent.ExecutorService; | |||
|
|||
import org.apache.hadoop.fs.s3a.S3ObjectStorageClassFilter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move down to the rest of the org.apache.
these guava things are in the wrong block due to the big search and replace which created them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
@@ -581,6 +583,12 @@ public void initialize(URI name, Configuration originalConf) | |||
|
|||
s3aInternals = createS3AInternals(); | |||
|
|||
s3ObjectStorageClassFilter = Optional.ofNullable(conf.get(READ_RESTORED_GLACIER_OBJECTS)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or we just go for "upper case is required" and use what you've proposed. more brittle but simpler?
S3Client s3Client = getFileSystem().getS3AInternals().getAmazonS3Client("test"); | ||
|
||
// Create a restore object request | ||
RestoreObjectRequest requestRestore = RestoreObjectRequest.builder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer this was in the RequestFactory interface and builder, as it'll let us do things like add audit context and anything else in future
getFilePrefixForListObjects(), "/"); | ||
S3Object s3GlacierObject = getS3GlacierObject(s3Client, s3ListRequest); | ||
|
||
while ((s3GlacierObject != null && s3GlacierObject.restoreStatus().isRestoreInProgress()) && retryCount < MAX_RETRIES) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LambdaTestUtils.await()
is designed to handle this.
skipIfStorageClassTestsDisabled(newConf); | ||
disableFilesystemCaching(newConf); | ||
removeBaseAndBucketOverrides(newConf, STORAGE_CLASS); | ||
newConf.set(REJECT_OUT_OF_SPAN_OPERATIONS, "false"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or better: create an audit span
|
||
enum Type { GLACIER_AND_DEEP_ARCHIVE, GLACIER } | ||
|
||
@Parameterized.Parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- look at other uses of this to see how we generate useful strings for logs
- be aware the pattern is used in the method path, so musn't create invalid paths. it's just text is so much better than [0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure will update
...doop-aws/src/test/java/org/apache/hadoop/fs/s3a/list/ITestS3AReadRestoredGlacierObjects.java
Show resolved
Hide resolved
Regarding the above, I will update the config read to getTrimmed and set the config values of defualt to READ_ALL instead of "".
the valueof method will throw the illegalArgs exception if an invalid value is set in the config |
d035d15
to
e110e2a
Compare
💔 -1 overall
This message was automatically generated. |
Build failed due to an OOM Error, build passing on local
|
e110e2a
to
c5dcd5c
Compare
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really close now: just import placement (always a PITA) and some checkstyle line length complaints. Other than that: ready to merge!
@@ -18,6 +18,7 @@ | |||
|
|||
package org.apache.hadoop.fs.s3a; | |||
|
|||
import org.apache.hadoop.fs.s3a.api.S3ObjectStorageClassFilter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new org.apache imports MUST go into that block for new classes. set your IDE up for this and life gets simpler for all
@@ -52,6 +52,7 @@ | |||
import java.util.concurrent.atomic.AtomicBoolean; | |||
import javax.annotation.Nullable; | |||
|
|||
import org.apache.hadoop.fs.s3a.api.S3ObjectStorageClassFilter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
@@ -37,7 +37,7 @@ | |||
import org.apache.hadoop.fs.s3a.S3AFileStatus; | |||
import org.apache.hadoop.fs.s3a.S3AInputPolicy; | |||
import org.apache.hadoop.fs.s3a.S3AStorageStatistics; | |||
import org.apache.hadoop.fs.s3a.S3ObjectStorageClassFilter; | |||
import org.apache.hadoop.fs.s3a.api.S3ObjectStorageClassFilter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has to move down to L42 now, as its in a sub package.
|
||
import org.apache.hadoop.fs.s3a.S3AFileSystem; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now, this is a branch new class. in which case the L22 import can go into the org.apache block and the S3aFilesystem one with it. the only reason it is in the wrong place in existing code is that the move to repackaged classes was a big search and replace only: no re-ordering
@@ -89,7 +92,7 @@ private FileSystem createFiles(String s3ObjectStorageClassFilter) throws Throwab | |||
FileSystem fs = contract.getTestFileSystem(); | |||
Path dir = methodPath(); | |||
fs.mkdirs(dir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skip this for a marginal increase in performance. create() assumes there's a dir and just creates the object in the destination path without checks.
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java
Outdated
Show resolved
Hide resolved
@bpahuja have you tested this with any of: s3 express, third party stores, gcs? Just want to make sure things work and the tests are skipped? I'll inevitably do the test runs anyway, but it'd save me time and effort, and as I don't run those regularly enough, multiple commits may be the source of regressions |
Hello @steveloughran , wanted to confirm the requirement here ? Is it that you wanted me to ensure that this test is skipped when ran in other environments ? If that's the case then IMO, it should work as other Integration tests do in this package as it is being inheriting Do let me know if something other then this is needed from me :) Thanks |
🎊 +1 overall
This message was automatically generated. |
Hello @steveloughran, Just a gentle reminder to review the PR. Thanks 😄 |
🎊 +1 overall
This message was automatically generated. |
Hello @steveloughran, Just a gentle reminder to review the PR. Thanks 😄 |
I'm doing nothing but helping it get Hadoop 3.4.0 out the door this week. No code of my except related to packaging; no reviews of other people except related to the release at all. Sorry. Well you are waiting for a review on your code from myself or someone else -why not getting involved? There is no release engineering team doing this and it is up to all of us developers in the community to get our hands dirty. We all have different deployment environments and we all have different things we want to test. And, given this is the first public release with the AWS V2 SDK: it matters a lot that this thing ships. It will be the way we actually find real world bugs. Look on the hadoop common-dev list for the announcement of the next release candidate: once the announcement is made we have five days to test and vote on the RC. That is why we are under so much time pressure here. Note I've created a project, https://github.com/apache/hadoop-release-support , to assist in qualifying. One thing which would be good it would be some extra scripts we could run to help validate storage operations -operations which we can then execute against cloud storage from either the local host or a remote one -I am qualifying the Arm64 build on a raspberry pi5 under my television and would like to have that testing fully automated. Anything you can do here will be very much appreciated. And like I said: I'm unlikely to be looking at any other code right now. |
Hello @steveloughran, Just a gentle reminder to review the PR :) , Do take a look whenever you get some time Thanks |
I am catching up now, as warned giving priority to people who helped validate the RC. the core project is a community project, and everyone gets to participate |
import software.amazon.awssdk.services.s3.model.S3Object; | ||
|
||
import org.apache.hadoop.fs.s3a.S3AFileSystem; | ||
import org.apache.hadoop.thirdparty.com.google.common.collect.Sets; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you use org.apache.hadoop.util.Sets here. its part of our attempt to isolate ourselves better from guava changes and the pain that causes downstream
...-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/api/S3ObjectStorageClassFilter.java
Show resolved
Hide resolved
...-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/api/S3ObjectStorageClassFilter.java
Show resolved
Hide resolved
|
||
[Amazon S3 Glacier (S3 Glacier)](https://docs.aws.amazon.com/amazonglacier/latest/dev/introduction.html) is a secure and durable service for low-cost data archiving and | ||
long-term backup. | ||
With S3 Glacier, you can store your data cost effectively for months, years, or even decades. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace "you" with something like "it is possible to store data more cost-effectively"; this is the docs for the connector, not marketing...
} | ||
|
||
private FileSystem createFiles(String s3ObjectStorageClassFilter) throws Throwable { | ||
Configuration conf = this.createConfiguration(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I meant "remove the this.
prefix"; just use createConfiguration() directly
try (FileSystem fs = createFiles(S3ObjectStorageClassFilter.SKIP_ALL_GLACIER.name())) { | ||
Assertions.assertThat( | ||
fs.listStatus(methodPath())) | ||
.describedAs("FileStatus List of %s", methodPath()).isEmpty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put the .isEmpty() on the line below; we like to try and split them up as they're very complex to read.
...doop-aws/src/test/java/org/apache/hadoop/fs/s3a/list/ITestS3AReadRestoredGlacierObjects.java
Show resolved
Hide resolved
}); | ||
} | ||
|
||
private final int maxRetries = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are static/final, so make constants and upper case.
conf.set(READ_RESTORED_GLACIER_OBJECTS, s3ObjectStorageClassFilter); | ||
// Create Glacier objects:Storage Class:DEEP_ARCHIVE/GLACIER | ||
conf.set(STORAGE_CLASS, glacierClass); | ||
S3AContract contract = (S3AContract) createContract(conf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to do this. just do
fs = new S3AFileSystem()
fs.init(conf)
...doop-aws/src/test/java/org/apache/hadoop/fs/s3a/list/ITestS3AReadRestoredGlacierObjects.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy with production code apart from a couple of minor comments; been testing locally now.
Having just tested it, it took 2+ minutes to complete, even while skipping 6 tests.
Even as part of a parallel run, that is going to hurt.
Would it be possible to abandon the elegance and structure of parameterised test suites and try and combine them?
that is:
Create one glaciated object and then verify the different filesystem configurations do/don't find it. This will save a lot of set up and tear down overhead.
I know it goes against the "test one thing per method" and "parameterize" but it helps scale.
if you look at the AbstractSTestS3AHugeFiles
tests you can see how by running the test methods in name order we can actually retain some of this isolation. however, it complicates life in other ways (we need to keep the test files out of the normal test methodpath, can't run a test on its own anyway,... -so don't bother with this, just know that sometimes testing just gets complicated.
side issue: is this test going to run up significant charges?
@bpahuja can you address the review comments? Otherwise I'm going to forget about it. Get it done now and we can target 3.4.1 |
Sure @steveloughran, will prioritize this |
…s3-api and some minor changes
…ther minor changes
cbb8580
to
c1c72d9
Compare
@steveloughran , I have pushed the changes for Thanks :) |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, been busy and neglected this
I've got some minor comments and then once yetus is happy we can merge.
Note that in #6789 I'm adding some case insensitive way to resolve a list of enum values in a config, for now the work you've done is the only way to be case insensitive...pity.
now, I wonder what is going to break on stores without this feature?
what happens on S3 Express stores?
I've proposed how to bind to the existing skipIfStorageClassTestsDisabled(conf)
call which will skip the test during setup.
LOG.warn("Invalid value for the config {} is set. Valid values are:" + | ||
"READ_ALL, SKIP_ALL_GLACIER, READ_RESTORED_GLACIER_OBJECTS. Defaulting to READ_ALL", | ||
READ_RESTORED_GLACIER_OBJECTS); | ||
s3ObjectStorageClassFilter = S3ObjectStorageClassFilter.READ_ALL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets fall to the default. maybe pull the conf.getTrimmed() out of the try {} so it's value can be printed too.
FWIW in #6789 I'm doing a getEnumSet()
which is case independent too.
try { | ||
s3ObjectStorageClassFilter = Optional.of(conf.getTrimmed(READ_RESTORED_GLACIER_OBJECTS, | ||
DEFAULT_READ_RESTORED_GLACIER_OBJECTS)) | ||
.map(String::toUpperCase) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Locale.ROOT)
} | ||
|
||
/** | ||
* Returns the filter function set as part of the enum definition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs a trailing .
@@ -117,6 +117,8 @@ public class StoreContext implements ActiveThreadSpanSource<AuditSpan> { | |||
/** Is client side encryption enabled? */ | |||
private final boolean isCSEEnabled; | |||
|
|||
private final S3ObjectStorageClassFilter s3ObjectStorageClassFilter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: javadocs
@@ -927,8 +927,34 @@ The switch to turn S3A auditing on or off. | |||
Should auditing of S3A requests be enabled? | |||
</description> | |||
</property> | |||
``` | |||
## <a name="glacier"></a> Glacier Object Support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a newline. thanks
import java.util.Arrays; | ||
import java.util.Collection; | ||
|
||
import org.apache.hadoop.fs.s3a.S3AFileSystem; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to the apache group
|
||
private FileSystem createFileSystem(String s3ObjectStorageClassFilter) throws Throwable { | ||
Configuration conf = createConfiguration(); | ||
conf.set(READ_RESTORED_GLACIER_OBJECTS, s3ObjectStorageClassFilter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
insert line
skipIfStorageClassTestsDisabled(conf);
(and matching import
import static org.apache.hadoop.fs.s3a.S3ATestUtils.skipIfStorageClassTestsDisabled;
See ITestS3AStorageClass for an example.
Description of PR
Currently S3A does not distinguish Glacier and Glacier Deep Archive files, as it doesn't examine the storage class or verify if an object is in the process of restoration from Glacier. Attempting to access an in-progress Glacier file via S3A results in an AmazonS3Exception, indicating the operation's invalidity for the object's storage class.
As part of this change, Users will be able to successfully read restored glacier objects from the s3 location of the table using S3A. It will ignore any Glacier archived files if they are in process of being restored asynchronously. There will be no change to the existing behavior and additional configuration will be needed to enable the above mentioned flow.
The config which would control the behavior of the S3AFileSystem with respect to glacier storage classes will be
fs.s3a.glacier.read-restored-objects
The config can have 3 values:
READ_ALL
: This would conform to the current default behavior of not taking into account the storage classes retrieved from S3. This will be done to keep the current behavior for the users and not changing the experience for them.SKIP_ALL_GLACIER
: If this value is set then this will ignore any S3 Objects which are tagged with Glacier storage classes and retrieve the others.READ_RESTORED_GLACIER_OBJECTS
: If this value is set then restored status of the Glacier object will be checked, if restored the objects would be read like normal S3 objects else they will be ignored as the objects would not have been retrieved from the S3 Glacier. ( To check the restored status, newly introduced RestoredStatus will be used which is present in the S3 Object). This wasn't previously possible as ListObjects did not return any information about the restore status of an object, only it's storage class.A new
FileStatusAcceptor
is created which will use theRestoreStatus
attribute from theS3Object
and will filter out or include the glacier objects from the list as defined by the config.FileStatusAcceptor
is an interface with 3 overloaded predicates, which filter the files based on the conditions defined in the said predicates. A new attributeRestoreStatus
of will be used from the response ofListObjects
. This field will indicate whether the object is unrestored, restoring, or restored, and also when the expiration of that restore is.How was this patch tested?
Integration Tests (hadoop-aws)
All the Integration tests are passing, the tests were run in accordance with https://hadoop.apache.org/docs/current2/hadoop-aws/tools/hadoop-aws/testing.html. The tests were executed in the region
us-east-1
.There were 2 failures observed which seem intermittent and unrelated to the change introduced in this CR. As the default behavior of S3AFileSystem was not changed.
Failures observed :
Manual Testing
Manual testing of the change was done with Spark v3.5
A Parquet table was created using the following in Spark-SQL.
Was able to successfully retrieve the data using the following.
The storage class of the file
s3://<bucket>/data/glacier_test/parquet_glacier_test/part-00000-f9cb400e-35b2-41f7-9c39-8e34cd830fed-c000.snappy.parquet
was changed toGlacier Flexible Retrieval (formerly Glacier)
fromStandard
.When trying to retrieve the data again form the same table, the following exception was observed.
I restarted the spark-sql session by setting the following config.
Trying to access the table now , resulted in the following.
The spark-sql session was restarted with the following config
Trying to access the table now , resulted in the same result as the previous step as unrestored glacier file was ignored when the table was read.
The restore for the file
s3://<bucket>/data/glacier_test/parquet_glacier_test/part-00000-f9cb400e-35b2-41f7-9c39-8e34cd830fed-c000.snappy.parquet
was initiated using the S3 Console.Trying to access the table now , resulted in the same result as the previous step as the glacier file was still being restored and was not available.
On retrying after 5-7 minutes ( as it was expedited retrieval ) the following was the result, which is as expected :
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?