Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure ufs to only once load a native shared library #14058

Open
wants to merge 4 commits into
base: master-2.x
Choose a base branch
from

Conversation

jhonxue
Copy link
Contributor

@jhonxue jhonxue commented Sep 13, 2021

Fixes #14142

What changes are proposed in this pull request?

Make sure ufs to only once load a native shared library(.so) at runtime, through making sure the same classloader for the same ufs

Why are the changes needed?

Otherwise cephfs mount will failed cuz loading same native library for multiple times

Does this PR introduce any user facing changes?

N/A

Signed-off-by: Xue Yantao [email protected]

@codecov-commenter
Copy link

codecov-commenter commented Sep 13, 2021

Codecov Report

Merging #14058 (37139ca) into master (4d83358) will increase coverage by 0.05%.
The diff coverage is 85.71%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #14058      +/-   ##
============================================
+ Coverage     43.29%   43.34%   +0.05%     
- Complexity     9171     9230      +59     
============================================
  Files          1417     1425       +8     
  Lines         82084    82518     +434     
  Branches       9938     9984      +46     
============================================
+ Hits          35538    35771     +233     
- Misses        43584    43797     +213     
+ Partials       2962     2950      -12     
Impacted Files Coverage Δ
...a/alluxio/extensions/ExtensionFactoryRegistry.java 62.22% <85.71%> (+1.24%) ⬆️
...lluxio/client/metrics/MetricsHeartbeatContext.java 77.21% <0.00%> (-3.80%) ⬇️
...on/src/main/java/alluxio/collections/LockPool.java 79.78% <0.00%> (-3.20%) ⬇️
...o/worker/grpc/BlockWorkerClientServiceHandler.java 23.61% <0.00%> (-1.39%) ⬇️
.../java/alluxio/worker/block/DefaultBlockWorker.java 67.59% <0.00%> (-1.11%) ⬇️
.../java/alluxio/master/job/plan/PlanCoordinator.java 73.22% <0.00%> (-0.59%) ⬇️
...uxio/client/file/cache/LocalCacheFileInStream.java 75.32% <0.00%> (-0.16%) ⬇️
...in/java/alluxio/proxy/s3/S3RestServiceHandler.java 3.08% <0.00%> (-0.02%) ⬇️
.../main/java/alluxio/cli/fs/command/LoadCommand.java 0.00% <0.00%> (ø)
...n/src/main/java/alluxio/util/EnvironmentUtils.java 0.00% <0.00%> (ø)
... and 54 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d83358...37139ca. Read the comment docs.

Copy link
Contributor

@apc999 apc999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments left.

* @param factories list of factories
* @return list of factories that support the given path which may be an empty list
*/
public List<T> select(String path, S conf, List<T> factories) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this method to be public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this method to be public?

may be not. private is better

* @return list of factories that support the given path which may be an empty list
*/
public List<T> select(String path, S conf, List<T> factories) {
Preconditions.checkArgument(path != null, "path may not be null");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alluxio code convention:

Suggested change
Preconditions.checkArgument(path != null, "path may not be null");
Preconditions.checkNotNull(path, "path");

return eligibleFactories;
}

factories.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't understand why we can prevent loading native shared lib multiple times by calling clear here.
can you elaborate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't understand why we can prevent loading native shared lib multiple times by calling clear here.
can you elaborate?

At this line, we did not find any supported factories from the list of factories. So we clear it to avoid to twicely check them wether support or not.

@jhonxue jhonxue requested a review from apc999 September 13, 2021 08:35
@maobaolong
Copy link
Contributor

#10928 Not sure it is same root cause with this issue.

@jhonxue jhonxue closed this Sep 18, 2021
@jhonxue jhonxue reopened this Sep 18, 2021
@jhonxue jhonxue closed this Sep 18, 2021
@jhonxue jhonxue reopened this Sep 18, 2021
@jhonxue jhonxue closed this Sep 18, 2021
@jhonxue jhonxue reopened this Sep 18, 2021
@jhonxue jhonxue closed this Sep 18, 2021
@jhonxue jhonxue reopened this Sep 18, 2021
@jhonxue jhonxue closed this Sep 18, 2021
@jhonxue jhonxue reopened this Sep 18, 2021
@jhonxue jhonxue closed this Sep 18, 2021
@jhonxue jhonxue reopened this Sep 18, 2021
@jhonxue jhonxue closed this Sep 18, 2021
@jhonxue jhonxue reopened this Sep 18, 2021
Copy link
Contributor

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this improvement, lt looks like solve our problem, left some question inline, just make sure there are no downside for this PR

@@ -131,19 +131,18 @@ private synchronized void init() {
Preconditions.checkArgument(path != null, "path may not be null");

List<T> factories = new ArrayList<>(mFactories);
List<T> eligibleFactories = select(path, conf, factories);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the first time findAll invoked, the mFactories should be empty, and this method will run like original logic, later new findAll will not load the extension or lib jar at all? So please show the effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the first time findAll invoked, the mFactories should be empty, and this method will run like original logic, later new findAll will not load the extension or lib jar at all? So please show the effect.

Yes, the new 'findAll' will not dynamicly load the ext or lib jar after the first load done. However, we can restart the alluxio master or worker to load the new updated jar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please highlight add this important limitation to the related issue or the description of this PR. It would be better if you write an approach description shortly to let reviewer know your solution and make it easy to review this PR

@@ -31,7 +31,6 @@
<groupId>io.github.opendataio</groupId>
<artifactId>libcephfs</artifactId>
<version>0.0.1</version>
<scope>provided</scope>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose you remove the provided scope, you want to include the libcephfs and related share lib into the ufs shaded with dependencies jar?

Copy link
Contributor Author

@jhonxue jhonxue Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose you remove the provided scope, you want to include the libcephfs and related share lib into the ufs shaded with dependencies jar?

Yes, this pr will solve the problem about the libcephfs_jni.so is already loaded in another classloader, so libcephfs can be included.

@jhonxue
Copy link
Contributor Author

jhonxue commented Sep 22, 2021

The native shared library was loaded in another classloader:

Below is the master log in my test:

2021-09-18 15:49:21,444 WARN UnderFileSystem$Factory - Failed to create UnderFileSystem by factory alluxio.underfs.cephfs.CephFSUnderFileSystemFactory@5e2694c1: java.lang.UnsatisfiedLinkError: Native Library /usr/lib64/libcephfs_jni.so.1.0.0 already loaded in another classloader
2021-09-18 15:49:21,444 ERROR FileSystemMasterClientServiceHandler - Exit (Error): Mount: request=alluxioPath: "/fs2"

@jhonxue
Copy link
Contributor Author

jhonxue commented Sep 22, 2021

After used this pr: the native shared lib problem is solved

We test underfs-cephfs and underfs-cephfs-hadoop
[root@s131 /data/alluxio-2.7.0-SNAPSHOT]# bin/alluxio fs mount
cephfs://xx.xx.xx.A:6789/ on /fs2 (cephfs, capacity=711.02TB, used=86.41GB(0%), not read-only, not shared})
cephfs://xx.xx.xx.B:6789/ on /fs3 (cephfs, capacity=24.59TB, used=58.50GB(0%), not read-only, not shared})
ceph://xx.xx.xx.A:6789/ on /fs1 (hdfs, capacity=-1B, used=-1B, read-only, not shared)
ceph://xx.xx.xx.B:6789/ on /fs4 (hdfs, capacity=-1B, used=-1B, read-only, not shared})
/data/alluxio-2.7.0-SNAPSHOT/underFSStorage on / (local, capacity=31.90GB, used=29.72GB(93%), not read-only, not shared, properties={})

@jhonxue jhonxue requested a review from maobaolong September 22, 2021 03:25
@jhonxue
Copy link
Contributor Author

jhonxue commented Sep 22, 2021

@apc999 PTAL

@LuQQiu LuQQiu self-requested a review September 23, 2021 02:47
@LuQQiu
Copy link
Contributor

LuQQiu commented Sep 23, 2021

Mainly saw in Ceph, C++ library

Copy link
Contributor

@LuQQiu LuQQiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making class loader changes, left some comments

@@ -201,6 +200,7 @@ private void scan(List<File> files, List<T> factories) {
LOG.debug("Discovered a factory implementation {} - {} in jar {}", factory.getClass(),
factory, jarPath);
register(factory, factories);
register(factory);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  /**
   * The base list of factories, which does not include any lib or extension factories. The only
   * factories in the base list will be built-in factories, and any additional factories
   * registered by tests. All other factories will be discovered and service loaded when extension
   * creation occurs.
   */
  private final List<T> mFactories = new CopyOnWriteArrayList<>();

we modify the basic assumption of what mFactories represents

will it be better to include a new list call mDynamicLoadedFactories?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If mFactories contains -> return
  2. If mDynamicLoadedFactories contains -> return
  3. If not, renew the mDynamicLoadedFactories?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apc999 What's your thoughts

Copy link
Contributor Author

@jhonxue jhonxue Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If mFactories contains -> return
  2. If mDynamicLoadedFactories contains -> return
  3. If not, renew the mDynamicLoadedFactories?

Yea, i think it sounds better! By the way, may be that we can add a method "boolean supportsDynamicLoad()" in the public interface ExtensionFactory to distinguish the underFileSystemFactories, the default return is true, likes the existed method "boolean supportsPath(String path, S conf)",however, the specific underFileSystemFactories which do not support dynamic load can override the method, so we can still only use mFactories.

What's your thoughts? @apc999 @LuQQiu @maobaolong

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer this way, but is it complicated to add the logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer this way, but is it complicated to add the logic?
Based on your idea, the steps are :
1. If mFactories contains && !supportsDynamicLoad() -> return
2. If mDynamicLoadedFactories contains && !supportsDynamicLoad() -> return
3. If not, renew the mDynamicLoadedFactories?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workload SGTM, @apc999 @ggezer @yuzhu PTAL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiacheliu3 I think we need to make an agreement on how to address this issue. This part is kind of critical and nobody really familiar with it. More eyes need to be pulled in to think of how to load the extension factories

@jhonxue jhonxue requested a review from LuQQiu September 24, 2021 03:01
@jiacheliu3
Copy link
Contributor

@apc999 @LuQQiu Does the change look good to you?

@jiacheliu3
Copy link
Contributor

/ping @apc999 @LuQQiu

Copy link
Contributor

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we patching this PR and it can works in our Env, It LGTM.

@jiacheliu3
Copy link
Contributor

/ping @LuQQiu

@LuQQiu
Copy link
Contributor

LuQQiu commented Jan 9, 2023

Sorry for the late review. @jhonxue can you help merge the latest master into the branch to solve the conflicts?

@jhonxue
Copy link
Contributor Author

jhonxue commented Jan 10, 2023

Sorry for the late review. @jhonxue can you help merge the latest master into the branch to solve the conflicts?

ok, my pleasure. Also done the conflicts

@@ -159,6 +165,7 @@ public List<T> findAllWithRecorder(String path, S conf, Recorder recorder) {
recorder.record("alluxio.underfs.version is not set by user");
}

eligibleFactories = select(path, conf, factories);
for (T factory : factories) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is select duplicate with the following for loop

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The PR/Issue does not have recent activities and will be closed automatically label Feb 10, 2023
@jiacheliu3
Copy link
Contributor

@LuQQiu @jhonxue Is this PR still valid? thanks!

@github-actions github-actions bot removed the stale The PR/Issue does not have recent activities and will be closed automatically label Apr 7, 2023
@github-actions
Copy link

github-actions bot commented Jun 5, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The PR/Issue does not have recent activities and will be closed automatically label Jun 5, 2023
@github-actions github-actions bot removed the stale The PR/Issue does not have recent activities and will be closed automatically label Jun 25, 2024
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The PR/Issue does not have recent activities and will be closed automatically label Jul 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
POM Change stale The PR/Issue does not have recent activities and will be closed automatically
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make sure ufs to only once load a native shared library
8 participants