Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix create worker process timeout #13772

Open
wants to merge 5,193 commits into
base: master-2.x
Choose a base branch
from

Conversation

shenghui361
Copy link

What changes are proposed in this pull request?

Please outline the changes and how this PR fixes the issue.

Why are the changes needed?

Please clarify why the changes are needed. For instance,

  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, describe the bug.

Does this PR introduce any user facing changes?

Please list the user-facing changes introduced by your change, including

  1. change in user-facing APIs
  2. addition or removal of property keys
  3. webui

liupan664021 and others added 30 commits April 26, 2021 20:41
`./dev/scripts/generate-tarballs` is really convinient for alluxio
developers to generate custom alluxio tarballs. But as a custom version,
not all underfs and tool jars under ./lib are necessary to be added to
the tarballs. Maybe a custom `libjars` flag is helpful to slim the
tarball by size when generating custom tarballs.

pr-link: Alluxio#13292
change-id: cid-05e32b43f576e9198ac16b1eb893672c21bacf88
pr-link: Alluxio#13301
change-id: cid-2b370bd10ffb70a8fccecf7698bdebc3c6a6ff4b
MigrateConfig has an overwrite flag that didn't do anything.

This has the side effect where distributedCp and distributedMv are now
going to overwrite and the retries are going to work in more cases (when
destination file is partially written).

pr-link: Alluxio#13262
change-id: cid-343fd1cb344ba26a5c2e44554df971e4298ea7c6
pr-link: Alluxio#13304
change-id: cid-ab69b10959432889cc70b4dd94d3d38fedd6da6d
Before this PR:
I want to run Alluxio in my IDE.But it has such error

![image](https://user-images.githubusercontent.com/31469905/116167803-a5194700-a733-11eb-85f6-7c6f8fa71831.png)
and master.log

`org.glassfish.jersey.message.internal.WriterInterceptorExecutor$TerminalWriterInterceptor
aroundWriteTo
MessageBodyWriter not found for media type=application/json, type=class
alluxio.wire.MasterWebUIInit, genericType=class
alluxio.wire.MasterWebUIInit`

After this PR:
(It has no effect on packaging Alluxio cause it is `provided`)
master:

![image](https://user-images.githubusercontent.com/31469905/116167889-d85bd600-a733-11eb-87fa-b247645adf88.png)
worker:

![image](https://user-images.githubusercontent.com/31469905/116167902-df82e400-a733-11eb-8c3d-66b6dcfffbc2.png)

pr-link: Alluxio#13299
change-id: cid-02422891708872733cdac586fc31fe25150fb25f
pr-link: Alluxio#13305
change-id: cid-a29e313701fe62976d2e996b6f4c0793a85eaf39
When enabling SSE, PUT/GET requests to S3 must use HTTPS or the response
will be a 403. Instead of having the user specify the need for HTTPS
manually, we should enable it by default whenever SSE is specified.

pr-link: Alluxio#13306
change-id: cid-cbdb8085d13da4cab83d054f5ff7d917b37aff6e
pr-link: Alluxio#13308
change-id: cid-3e30b8b8066e70b125e742b7cdee2b715f37e33e
`ExpectedException.none()` is deprecated in JUnit 4.13. It's recommended
to change it to `Assert.assertThrows`.

```
package org.junit.rules;
public class ExpectedException implements TestRule {
/**
* Returns a {@linkplain TestRule rule} that expects no exception to
* be thrown (identical to behavior without this rule).
*
* @deprecated Since 4.13
* {@link org.junit.Assert#assertThrows(Class,
org.junit.function.ThrowingRunnable)
* Assert.assertThrows} can be used to verify that your code throws a
specific
* exception.
*/
@deprecated
public static ExpectedException none() {
return new ExpectedException();
}
```

Considering the number of tests to update, I will do this module by
module. This PR takes care of the client package.

pr-link: Alluxio#12840
change-id: cid-78078c85087ffc82a885443b29634df54092c82f
pr-link: Alluxio#13302
change-id: cid-6ae2788ae89c93beac76514f85ace6324ccb3d86
Remove ceph module from underfs pom so that it will be skipped for maven
build. Temporary workaround for Alluxio#13309

pr-link: Alluxio#13310
change-id: cid-1d3a2deb581811ff8643863c8fc8c25e7abdc430
needed to have the publish tarball build pass after
Alluxio#13310

pr-link: Alluxio#13312
change-id: cid-597939de25d2c9ca16605d2c35b106582875005c
This PR is being made to address git [issue
Alluxio#13203](Alluxio#13203).

I tested that the Helm chart rendered using the following in my
`config.yaml`
```
tolerations: [ {"key": "test", "operator": "Equal", "value": "test",
"effect": "NoSchedule"} ]
```

and that the tolerations worked as intended on k8s nodes I tainted using
`kubectl taint nodes <node> test=test:NoSchedule`

**Note:** I tested the Helm chart using all default values; so no FUSE
nor logging-server deployments.

pr-link: Alluxio#13214
change-id: cid-81d850f3b53308a78b5bb207531b294ae09d8cdb
pr-link: Alluxio#13313
change-id: cid-2fb13b3e3bcbd4d52c65b95ddc9d587424793049
pr-link: Alluxio#13307
change-id: cid-8733a2486624645b792c74fbdc56af19258cd1bf
Fix Alluxio#10385

pr-link: Alluxio#13293
change-id: cid-3d846422f0541f8e6eaea1ca5a48063239e2155c
This change will allow for the Helm chart to specify
`serviceAccountName` to the Alluxio pods.

pr-link: Alluxio#13297
change-id: cid-44b49766eb0c3ed2fa2a44784cfb6ff08e42b8f0
Alluxio#13310 workaround for Alluxio#13309 doesn't work.
Remove the whole module validated to be able to solve the issue.

pr-link: Alluxio#13320
change-id: cid-40b3f2958b5349a089ab164d822b9c4a8a1b60e1
<img width="1124" alt="Screen Shot 2021-04-29 at 4 56 00 PM"
src="https://user-images.githubusercontent.com/14806853/116526083-cb91da80-a90b-11eb-8fc7-26f5343c5dfd.png">

pr-link: Alluxio#13316
change-id: cid-22ada119782755973ecf37aada8eb4eb54e5068f
This property is only used in tests. It also should not be set
explicitly by the users. It's confusing to display the property on the
doc page.
<img width="905" alt="Screen Shot 2021-04-29 at 4 53 54 PM"
src="https://user-images.githubusercontent.com/14806853/116525859-8a012f80-a90b-11eb-8cad-74f52a729686.png">

pr-link: Alluxio#13315
change-id: cid-894a1a818bb1ad835ba39d5c0ec4daadc1bc9893
Remove all the cephfs and ceph-hadoop related stuff from master branch.
as a workaround for Alluxio#13309

pr-link: Alluxio#13327
change-id: cid-e6dc92b5586412fccc4e8ea930b6ead2cda56e5c
Fix issue problem in Alluxio#13323
Need to change statefulset `podManagementPolicy` from default to
`Parallel`, so that multi-master node can start at same time.

pr-link: Alluxio#13325
change-id: cid-dc0bb964873214d4b1fdfdb680c1912e496f6d9c
Improve efficiency of `distributedLoad` in two fold:
1. calling `BlockInStream.create` to obtain the InputStream without RPC,
where the original `blockStore.getOutStream(blockId,...)` makes one RPC
to master to get blockInfo.
2. removing using `BlockOutStream` and only reads the block data to
alluxio job worker without writing it back to the block store on worker.
This will further save at least two RPCs and the overhead to transfer
data back to worker.

Note that, the original code is not removed to handle the case Pinned
Media, because currently reading a block does not support promoting this
block to a medium but only a tier. After discussed with @calvinjia , we
will leave this once we consolidate medium and tier on worker

pr-link: Alluxio#13146
change-id: cid-e4e50c5a23d8beda7b398f642a4263942780447e
Fix Alluxio#13287

pr-link: Alluxio#13288
change-id: cid-e58d20f50c0c9ebb5ba6b6d0cf3ea9016014cdb3
pr-link: Alluxio#13286
change-id: cid-26158f42c7f924e4c32d57343f8d9f02f389aef9
Add a test to verify that symlinks work when the underlying ufs supports
it

pr-link: Alluxio#13318
change-id: cid-97d57c402d8cc537fd8fe7c70e01af56fd2c3f7c
Trying to `./bin/alluxio free` an empty (0-bytes) files from Alluxio
results in the corresponding Alluxio job workers to repeatedly attempt
to free the file until reaching the job timeout.

This is because the implementation for `free` in
[FreeCommand](https://github.com/Alluxio/alluxio/blob/master/shell/src/main/java/alluxio/cli/fs/command/FreeCommand.java#L81)
uses `getInAlluxioPercentage() == 0` as the stopping condition, but
empty (length 0) files in Alluxio are [hardcoded as having 100 in
Alluxio
percentage](https://github.com/Alluxio/alluxio/blob/40af661d15104813f74eeda3790b7a366f7cb28f/core/server/master/src/main/java/alluxio/master/file/DefaultFileSystemMaster.java#L2114).

pr-link: Alluxio#13329
change-id: cid-267db50986cfab2156e31aec47a3cd6fee079bd5
The previous version slowed down the sync because the number of prefetch
threads were much smaller than the number of rpc threads that were
performing a sync in many cases.

pr-link: Alluxio#13334
change-id: cid-224a19d976bea2a55c6674d670861dad8d3a4c1f
RocksDB engineer claims that it is possible for the readOptions to get
garbage collected during iteration because RocksIterator doesn't hold a
reference to the object but instead a native resource owned by
readOptions.

pr-link: Alluxio#13340
change-id: cid-f2dc7c29f22e0851fe1d0b4b96fe5aaaa22ba806
Add unit test for client side FIFO Evictor.

pr-link: Alluxio#13339
change-id: cid-1b039bec331d002a3db73df943e6f741892a4231
### What changes are proposed in this pull request?

Update Readme

### Why are the changes needed?

Various minor tunings on readme file

### Does this PR introduce any user facing changes?

None

pr-link: Alluxio#13783
change-id: cid-f52f0d6d88e96507dc8252c70cd6f9d03d3c62f7
maobaolong and others added 3 commits July 14, 2021 11:20
Fix Alluxio#13726

### What changes are proposed in this pull request?

Export JVM related metrics

### Why are the changes needed?

Add ClassLoadingGaugeSet and CachedThreadStatesGaugeSet to Alluxio
MetricRegistry.

### Does this PR introduce any user facing changes?

Add more metrics.

pr-link: Alluxio#13762
change-id: cid-0bdad6a119292c20dc5378cfa8f2a104878cb46e
### What changes are proposed in this pull request?

Switching GRPC and WEB start sequence.

### Why are the changes needed?

In my tests I found that grpc starts quickly (about 1s) but the web
service takes about 15s. Switching the order can make GRPC accept the
request earlier but the impact on the web service is minimal when from
follower to leader,

### Does this PR introduce any user facing changes?

no user facing changes.

pr-link: Alluxio#13781
change-id: cid-2a896d3e9226149219ef8c0ece7ec28731cefe6c
@codecov-commenter
Copy link

codecov-commenter commented Aug 7, 2021

Codecov Report

Merging #13772 (28a6c42) into master (b769c6e) will decrease coverage by 0.05%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #13772      +/-   ##
============================================
- Coverage     43.59%   43.53%   -0.06%     
+ Complexity     9071     9062       -9     
============================================
  Files          1385     1385              
  Lines         80483    80489       +6     
  Branches       9801     9800       -1     
============================================
- Hits          35083    35043      -40     
- Misses        42446    42498      +52     
+ Partials       2954     2948       -6     
Impacted Files Coverage Δ
...common/src/main/java/alluxio/conf/PropertyKey.java 99.44% <100.00%> (+<0.01%) ⬆️
...main/java/alluxio/worker/AlluxioWorkerProcess.java 68.50% <100.00%> (ø)
...a/alluxio/exception/status/CancelledException.java 0.00% <0.00%> (-42.86%) ⬇️
...rc/main/java/alluxio/worker/block/PinListSync.java 76.92% <0.00%> (-23.08%) ⬇️
...e/common/src/main/java/alluxio/AbstractClient.java 59.35% <0.00%> (-12.26%) ⬇️
...luxio/worker/block/meta/StorageDirEvictorView.java 69.69% <0.00%> (-12.13%) ⬇️
.../src/main/java/alluxio/retry/TimeBoundedRetry.java 84.61% <0.00%> (-11.54%) ⬇️
.../worker/block/management/tier/SwapRestoreTask.java 62.83% <0.00%> (-10.62%) ⬇️
...ain/java/alluxio/worker/block/BlockMasterSync.java 60.65% <0.00%> (-4.92%) ⬇️
...src/main/java/alluxio/grpc/GrpcConnectionPool.java 84.02% <0.00%> (-4.87%) ⬇️
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b769c6e...28a6c42. Read the comment docs.

Copy link
Contributor

@apc999 apc999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments left.

@jiacheliu3 it seems having a property key here is necessary.
@ggezer as we discussed before, let's allow more keys and I will have separate issues to propose having internal / hidden property keys from public keys

@jiacheliu3
Copy link
Contributor

@apc999
when the worker contains more than 5 million blocks, the startup process fails due to timeout(for details, please refer to #13771),and the default timeout period is fixed,so i had to modify the code。

Fixed #13771

Please add the reason to the description section so people can know what this is about, after we merge this PR.

Copy link
Contributor

@jiacheliu3 jiacheliu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvements! Added a few comments.

@ggezer
Copy link
Contributor

ggezer commented Aug 23, 2021

@shenghui361 can you please fix your github account email so that our checks are cleared?

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The PR/Issue does not have recent activities and will be closed automatically label Jan 27, 2023
@github-actions github-actions bot removed the stale The PR/Issue does not have recent activities and will be closed automatically label May 16, 2023
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The PR/Issue does not have recent activities and will be closed automatically label Jun 15, 2023
@github-actions github-actions bot removed the stale The PR/Issue does not have recent activities and will be closed automatically label Jun 25, 2024
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The PR/Issue does not have recent activities and will be closed automatically label Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
POM Change stale The PR/Issue does not have recent activities and will be closed automatically
Projects
None yet
Development

Successfully merging this pull request may close these issues.