Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-19236. Integration of Volcano Engine TOS in Hadoop. #7194

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

wojiaodoubao
Copy link
Contributor

@wojiaodoubao wojiaodoubao commented Dec 1, 2024

Description of PR

Volcano Engine is a fast growing cloud vendor launched by ByteDance, and TOS is the object storage service of Volcano Engine. A common way is to store data into TOS and run Hadoop/Spark/Flink applications to access TOS. But there is no original support for TOS in hadoop, thus it is not easy for users to build their Big Data System based on TOS.

This work aims to integrate TOS with Hadoop to help users run their applications on TOS. Users only need to do some simple configuration, then their applications can read/write TOS without any code change. This work is similar to AWS S3, AzureBlob, AliyunOSS, Tencnet COS and HuaweiCloud Object Storage in Hadoop.

Please see the issue for more details. https://issues.apache.org/jira/browse/HADOOP-19236

How was this patch tested?

Unit tests need to connect to tos service. Setting the 6 environment variables below to run unit tests.

export TOS_ACCESS_KEY_ID={YOUR_ACCESS_KEY}
export TOS_SECRET_ACCESS_KEY={YOUR_SECRET_ACCESS_KEY}
export TOS_ENDPOINT={TOS_SERVICE_ENDPOINT}
export FILE_STORAGE_ROOT=/tmp/local_dev/
export TOS_BUCKET={YOUR_BUCKET_NAME}
export TOS_UNIT_TEST_ENABLED=true

Then cd to hadoop project root directory, and run the test command below.

mvn -Dtest=org.apache.hadoop.fs.tosfs.** test -pl org.apache.hadoop:hadoop-tos-core

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 0s Docker mode activated.
-1 ❌ patch 0m 15s #7194 does not apply to trunk. Rebase required? Wrong Branch? See https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute for help.
Subsystem Report/Notes
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7194/1/console
versions git=2.34.1
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 0s Docker mode activated.
-1 ❌ patch 0m 15s #7194 does not apply to trunk. Rebase required? Wrong Branch? See https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute for help.
Subsystem Report/Notes
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7194/2/console
versions git=2.34.1
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

import static org.apache.hadoop.fs.XAttrSetFlag.CREATE;
import static org.apache.hadoop.fs.XAttrSetFlag.REPLACE;

public class RawFileSystem extends FileSystem {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need the RawFileSystem, maybe we can just name it as the TosFileSystem directly, right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hadoop-tos module is inherited from VolcanoEngine EMR's FileSystem connector project. We should keep the classes the same as much as possible, so the new features in the commercial version could be easily transplanted to hadoop-tos.

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.tosfs.object.Constants;

public class RawFileStatus extends FileStatus {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similiar to the following comment, maybe we can use TosFileStatus directly ?


package org.apache.hadoop.fs.tosfs.conf;

public class ArgumentKey {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArgumentKey ? It's a bit unclear.. ?

import java.util.stream.Collectors;
import java.util.stream.Stream;

public class FileStore implements ObjectStorage {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this FileStore (which was designed for testing the abstracted ObjectStorage ) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll recommend to keep the FileStore, so all the unit tests could run independently from TOS. Currently the hadoop-tos module still depends on TOS to run most test cases. But in the future, my plan is to switch them to both FileStore and TOS, then we can test without TOS.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 0s Docker mode activated.
-1 ❌ patch 0m 13s #7194 does not apply to trunk. Rebase required? Wrong Branch? See https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute for help.
Subsystem Report/Notes
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7194/3/console
versions git=2.34.1
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@wojiaodoubao
Copy link
Contributor Author

The 'apply patch to trunk' error is caused by downloading a bad url. It should download 'https://github.com/apache/hadoop/pull/7194.patch' as the input.patch, but actual download is '#7194', which is the html content of this page. I run the test-patch manually with url 'https://github.com/apache/hadoop/pull/7194.patch', it works fine.

I'm still trying to figure out the cause. Does anybody know the reason? Thanks for any clues.

@wojiaodoubao
Copy link
Contributor Author

I found the cause, this patch is too large(24741 lines), exceeding github api's maximum number of lines (20000). It's convenient for reviewers to have an overview of the whole module, so I'll keep this pr for review.

…ntation.

Contributed by: ZhengHu, SunXin, XianyinXin, Rascal Wu, FangBo, Yuanzhihuan.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants