HADOOP-19236. Integration of Volcano Engine TOS in Hadoop. #7194

wojiaodoubao · 2024-12-01T11:57:03Z

Description of PR

Volcano Engine is a fast growing cloud vendor launched by ByteDance, and TOS is the object storage service of Volcano Engine. A common way is to store data into TOS and run Hadoop/Spark/Flink applications to access TOS. But there is no original support for TOS in hadoop, thus it is not easy for users to build their Big Data System based on TOS.

This work aims to integrate TOS with Hadoop to help users run their applications on TOS. Users only need to do some simple configuration, then their applications can read/write TOS without any code change. This work is similar to AWS S3, AzureBlob, AliyunOSS, Tencnet COS and HuaweiCloud Object Storage in Hadoop.

Please see the issue for more details. https://issues.apache.org/jira/browse/HADOOP-19236

How was this patch tested?

Unit tests need to connect to tos service. Setting the 6 environment variables below to run unit tests.

export TOS_ACCESS_KEY_ID={YOUR_ACCESS_KEY}
export TOS_SECRET_ACCESS_KEY={YOUR_SECRET_ACCESS_KEY}
export TOS_ENDPOINT={TOS_SERVICE_ENDPOINT}
export FILE_STORAGE_ROOT=/tmp/local_dev/
export TOS_BUCKET={YOUR_BUCKET_NAME}
export TOS_UNIT_TEST_ENABLED=true

Then cd to hadoop project root directory, and run the test command below.

mvn -Dtest=org.apache.hadoop.fs.tosfs.** test -pl org.apache.hadoop:hadoop-tos-core

For code changes:

Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

hadoop-yetus · 2024-12-01T11:59:39Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 0s		Docker mode activated.
-1 ❌	patch	0m 15s		#7194 does not apply to trunk. Rebase required? Wrong Branch? See https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute for help.

Subsystem	Report/Notes
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7194/1/console
versions	git=2.34.1
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

hadoop-yetus · 2024-12-01T14:04:47Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 0s		Docker mode activated.
-1 ❌	patch	0m 15s		#7194 does not apply to trunk. Rebase required? Wrong Branch? See https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute for help.

Subsystem	Report/Notes
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7194/2/console
versions	git=2.34.1
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

openinx · 2024-12-02T02:20:20Z

...oject/hadoop-tos/hadoop-tos-core/src/main/java/org/apache/hadoop/fs/tosfs/RawFileSystem.java

+import static org.apache.hadoop.fs.XAttrSetFlag.CREATE;
+import static org.apache.hadoop.fs.XAttrSetFlag.REPLACE;
+
+public class RawFileSystem extends FileSystem {


Do we still need the RawFileSystem, maybe we can just name it as the TosFileSystem directly, right ?

The hadoop-tos module is inherited from VolcanoEngine EMR's FileSystem connector project. We should keep the classes the same as much as possible, so the new features in the commercial version could be easily transplanted to hadoop-tos.

openinx · 2024-12-02T02:20:54Z

...oject/hadoop-tos/hadoop-tos-core/src/main/java/org/apache/hadoop/fs/tosfs/RawFileStatus.java

+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.tosfs.object.Constants;
+
+public class RawFileStatus extends FileStatus {


Similiar to the following comment, maybe we can use TosFileStatus directly ?

openinx · 2024-12-02T02:27:26Z

...ct/hadoop-tos/hadoop-tos-core/src/main/java/org/apache/hadoop/fs/tosfs/conf/ArgumentKey.java

+
+package org.apache.hadoop.fs.tosfs.conf;
+
+public class ArgumentKey {


ArgumentKey ? It's a bit unclear.. ?

openinx · 2024-12-02T02:29:51Z

...ct/hadoop-tos/hadoop-tos-core/src/main/java/org/apache/hadoop/fs/tosfs/object/FileStore.java

+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+public class FileStore implements ObjectStorage {


Do we still need this FileStore (which was designed for testing the abstracted ObjectStorage ) ?

I'll recommend to keep the FileStore, so all the unit tests could run independently from TOS. Currently the hadoop-tos module still depends on TOS to run most test cases. But in the future, my plan is to switch them to both FileStore and TOS, then we can test without TOS.

hadoop-yetus · 2024-12-02T10:53:44Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 0s		Docker mode activated.
-1 ❌	patch	0m 13s		#7194 does not apply to trunk. Rebase required? Wrong Branch? See https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute for help.

Subsystem	Report/Notes
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7194/3/console
versions	git=2.34.1
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

wojiaodoubao · 2024-12-03T02:11:21Z

The 'apply patch to trunk' error is caused by downloading a bad url. It should download 'https://github.com/apache/hadoop/pull/7194.patch' as the input.patch, but actual download is '#7194', which is the html content of this page. I run the test-patch manually with url 'https://github.com/apache/hadoop/pull/7194.patch', it works fine.

I'm still trying to figure out the cause. Does anybody know the reason? Thanks for any clues.

wojiaodoubao · 2024-12-09T13:43:58Z

I found the cause, this patch is too large(24741 lines), exceeding github api's maximum number of lines (20000). It's convenient for reviewers to have an overview of the whole module, so I'll keep this pr for review.

hadoop-yetus · 2025-03-10T07:38:04Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 0s		Docker mode activated.
-1 ❌	patch	0m 20s		#7194 does not apply to trunk. Rebase required? Wrong Branch? See https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute for help.

Subsystem	Report/Notes
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7194/5/console
versions	git=2.34.1
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

hadoop-yetus · 2025-03-10T07:47:49Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 0s		Docker mode activated.
-1 ❌	patch	0m 21s		#7194 does not apply to trunk. Rebase required? Wrong Branch? See https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute for help.

Subsystem	Report/Notes
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7194/6/console
versions	git=2.34.1
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

hadoop-yetus · 2025-03-10T08:53:00Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 0s		Docker mode activated.
-1 ❌	patch	0m 21s		#7194 does not apply to trunk. Rebase required? Wrong Branch? See https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute for help.

Subsystem	Report/Notes
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7194/7/console
versions	git=2.34.1
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

hadoop-yetus · 2025-03-11T13:00:06Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 0s		Docker mode activated.
-1 ❌	patch	0m 20s		#7194 does not apply to trunk. Rebase required? Wrong Branch? See https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute for help.

Subsystem	Report/Notes
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7194/8/console
versions	git=2.34.1
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

…ntation. Contributed by: ZhengHu, SunXin, XianyinXin, Rascal Wu, FangBo, Yuanzhihuan.

hadoop-yetus · 2025-03-15T01:09:33Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 0s		Docker mode activated.
-1 ❌	patch	0m 20s		#7194 does not apply to trunk. Rebase required? Wrong Branch? See https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute for help.

Subsystem	Report/Notes
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7194/9/console
versions	git=2.34.1
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

wojiaodoubao force-pushed the HADOOP-19236 branch from 3538027 to f15b69d Compare December 1, 2024 14:01

github-actions bot added build trunk labels Dec 1, 2024

openinx suggested changes Dec 2, 2024

View reviewed changes

wojiaodoubao force-pushed the HADOOP-19236 branch from f15b69d to b68e862 Compare December 2, 2024 10:51

wojiaodoubao mentioned this pull request Jan 17, 2025

HADOOP-19236. Incorporate VolcanoEngine Cloud TOS File System Implementation. Part 1: add core code and docs. #7294

Merged

4 tasks

wojiaodoubao mentioned this pull request Jan 26, 2025

HADOOP-19236. Incorporate VolcanoEngine Cloud TOS File System Implementation. Part 2: add unit tests. #7326

Closed

4 tasks

wojiaodoubao force-pushed the HADOOP-19236 branch from b68e862 to 91867ed Compare January 26, 2025 03:05

wojiaodoubao force-pushed the HADOOP-19236 branch from 91867ed to e974f45 Compare March 10, 2025 07:35

wojiaodoubao force-pushed the HADOOP-19236 branch from e974f45 to 36f6311 Compare March 10, 2025 07:45

wojiaodoubao force-pushed the HADOOP-19236 branch from 36f6311 to 2c53a68 Compare March 10, 2025 08:50

wojiaodoubao mentioned this pull request Mar 10, 2025

HADOOP-19236 Incorporate VolcanoEngine Cloud TOS File System Implementation. (Part 1: Core implementation) #7492

Closed

4 tasks

wojiaodoubao force-pushed the HADOOP-19236 branch from 2c53a68 to 6205784 Compare March 11, 2025 06:56

wojiaodoubao mentioned this pull request Mar 14, 2025

HADOOP-19236 Incorporate VolcanoEngine Cloud TOS File System Implementation. (Part 1: Core implementation) #7504

Merged

4 tasks

HADOOP-19236. Incorporate VolcanoEngine Cloud TOS File System Impleme…

bde3669

…ntation. Contributed by: ZhengHu, SunXin, XianyinXin, Rascal Wu, FangBo, Yuanzhihuan.

wojiaodoubao force-pushed the HADOOP-19236 branch from 6205784 to bde3669 Compare March 15, 2025 01:06

wojiaodoubao mentioned this pull request Mar 27, 2025

HADOOP-19236 Incorporate VolcanoEngine Cloud TOS File System Implementation. (Part 2: Add unit tests) #7545

Open

4 tasks


		package org.apache.hadoop.fs.tosfs.conf;

		public class ArgumentKey {

HADOOP-19236. Integration of Volcano Engine TOS in Hadoop. #7194

Are you sure you want to change the base?

HADOOP-19236. Integration of Volcano Engine TOS in Hadoop. #7194

Conversation

wojiaodoubao commented Dec 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of PR

How was this patch tested?

For code changes:

Uh oh!

hadoop-yetus commented Dec 1, 2024

Uh oh!

hadoop-yetus commented Dec 1, 2024

Uh oh!

openinx Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

wojiaodoubao Dec 3, 2024

Choose a reason for hiding this comment

Uh oh!

openinx Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

openinx Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

openinx Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

wojiaodoubao Dec 3, 2024

Choose a reason for hiding this comment

Uh oh!

hadoop-yetus commented Dec 2, 2024

Uh oh!

wojiaodoubao commented Dec 3, 2024

Uh oh!

wojiaodoubao commented Dec 9, 2024

Uh oh!

hadoop-yetus commented Mar 10, 2025

Uh oh!

hadoop-yetus commented Mar 10, 2025

Uh oh!

hadoop-yetus commented Mar 10, 2025

Uh oh!

hadoop-yetus commented Mar 11, 2025

Uh oh!

hadoop-yetus commented Mar 15, 2025

Uh oh!

Uh oh!

wojiaodoubao commented Dec 1, 2024 •

edited

Loading