diff --git a/README.md b/README.md index 86edc6b2e02..eff84922217 100644 --- a/README.md +++ b/README.md @@ -35,6 +35,7 @@ Apache Gravitino is a high-performance, geo-distributed, and federated metadata ![Gravitino Architecture](docs/assets/gravitino-architecture.png) Gravitino aims to provide several key features: + * Unified Metadata Management: Gravitino provides a unified model and API to manage different types of metadata, including relational (e.g., Hive, MySQL) and file-based (e.g., HDFS, S3) metadata sources. * End-to-End Data Governance: Gravitino offers a unified governance layer for managing metadata with features like access control, auditing, and discovery. * Direct Metadata Management: Gravitino connects directly to metadata sources via connectors, ensuring changes are instantly reflected between Gravitino and the underlying systems. diff --git a/docs/getting-started.md b/docs/getting-started.md index f729d418acf..4569db245e2 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -8,19 +8,19 @@ There are several options for getting started with Apache Gravitino. Installing If you want to download and install Gravitino: - - on AWS, see [Getting started on Amazon Web Services](#getting-started-on-amazon-web-services) - - Google Cloud Platform, see [Getting started on Google Cloud Platform](#getting-started-on-google-cloud-platform) - - locally, see [Getting started locally](#getting-started-locally) +- on AWS, see [Getting started on Amazon Web Services](#getting-started-on-amazon-web-services) +- Google Cloud Platform, see [Getting started on Google Cloud Platform](#getting-started-on-google-cloud-platform) +- locally, see [Getting started locally](#getting-started-locally) -If you have your own Apache Gravitino setup and want to use Apache Hive: +If you have your own Apache Gravitino setup and want to use Apache Hive: - - on AWS or Google Cloud Platform, see [Installing Apache Hive on AWS or Google Cloud Platform](#installing-apache-hive-on-aws-or-google-cloud-platform) - - locally, see [Installing Apache Hive locally](#installing-apache-hive-locally) +- on AWS or Google Cloud Platform, see [Installing Apache Hive on AWS or Google Cloud Platform](#installing-apache-hive-on-aws-or-google-cloud-platform) +- locally, see [Installing Apache Hive locally](#installing-apache-hive-locally) If you prefer to get started quickly and use Docker for Gravitino, Apache Hive, Trino, and others: - - on AWS or Google Cloud Platform, see [Installing Gravitino playground on AWS or Google Cloud Platform](#installing-apache-gravitino-playground-on-aws-or-google-cloud-platform) - - locally, see [Installing Gravitino playground locally](#installing-apache-gravitino-playground-locally) +- on AWS or Google Cloud Platform, see [Installing Gravitino playground on AWS or Google Cloud Platform](#installing-apache-gravitino-playground-on-aws-or-google-cloud-platform) +- locally, see [Installing Gravitino playground locally](#installing-apache-gravitino-playground-locally) If you are using AWS and want to access the instance remotely, be sure to read [Accessing Gravitino on AWS externally](#accessing-apache-gravitino-on-aws-externally) @@ -56,7 +56,6 @@ If you are using AWS and want to access the instance remotely, be sure to read [ 10. **Next steps** - Concluding thoughts and suggested next steps for users who have completed the setup. - ## Getting started on Amazon Web Services To begin using Gravitino on AWS, follow these steps: @@ -214,7 +213,7 @@ Gravitino provides a bundle of Docker images to launch a Gravitino playground, w includes Apache Hive, Apache Hadoop, Trino, MySQL, PostgreSQL, and Gravitino. You can use Docker Compose to start them all. -Installing Docker and Docker Compose is a requirement for using the playground. +Installing Docker and Docker Compose is a requirement for using the playground. ```shell sudo apt install docker docker-compose @@ -317,23 +316,21 @@ After completing these steps, you should be able to access the Gravitino REST in 1. **Explore documentation:** - Delve deeper into the Gravitino documentation for advanced features and configuration options. - - Check out https://gravitino.apache.org/docs/latest + - Check out 2. **Community engagement:** - Join the Gravitino community forums to connect with other users, share experiences, and seek assistance if needed. - - Check out our GitHub repository: https://github.com/apache/gravitino - - Check out our Slack channel in ASF Slack: https://the-asf.slack.com - + - Check out our GitHub repository: + - Check out our Slack channel in ASF Slack: + 3. **Read our blogs:** - - Check out: https://gravitino.apache.org/blog + - Check out: 4. **Continuous updates:** - - Stay informed about Gravitino updates and new releases to benefit from the latest features, optimizations, and security + - Stay informed about Gravitino updates and new releases to benefit from the latest features, optimizations, and security enhancements. - - Check out our Website: https://gravitino.apache.org + - Check out our Website: - This document is just the beginning. You're welcome to customize your Gravitino setup based on your requirements and to explore the vast possibilities this powerful tool offers. If you encounter any issues or have questions, you can always connect with the Gravitino community for assistance. - diff --git a/docs/gravitino-server-config.md b/docs/gravitino-server-config.md index 88edf23470f..cc60fe2cf73 100644 --- a/docs/gravitino-server-config.md +++ b/docs/gravitino-server-config.md @@ -64,12 +64,11 @@ The following table lists the storage configuration items: | `gravitino.entity.store.relational.jdbcPassword` | The password that the `JDBCBackend` needs to use when connecting the database. It is required for `MySQL`. | `gravitino` | Yes if the jdbc connection url is not `jdbc:h2` | 0.5.0 | | `gravitino.entity.store.relational.storagePath` | The storage path for embedded JDBC storage implementation. It supports both absolute and relative path, if the value is a relative path, the final path is `${GRAVITINO_HOME}/${PATH_YOU_HAVA_SET}`, default value is `${GRAVITINO_HOME}/data/jdbc` | `${GRAVITINO_HOME}/data/jdbc` | No | 0.6.0-incubating | - :::caution We strongly recommend that you change the default value of `gravitino.entity.store.relational.storagePath`, as it's under the deployment directory and future version upgrades may remove it. ::: -#### Create JDBC backend schema and table +#### Create JDBC backend schema and table For H2 database, All tables needed by Gravitino are created automatically when the Gravitino server starts up. For MySQL, you should firstly initialize the database tables yourself by executing the ddl scripts in the `${GRAVITINO_HOME}/scripts/mysql/` directory. @@ -94,7 +93,7 @@ Gravitino server uses tree lock to ensure the consistency of the data. The tree | Configuration item | Description | Default value | Since Version | |-------------------------------|--------------------------------------------------------------------------------------------------------------------------------|---------------|---------------| -| `gravitino.auxService.names ` | The auxiliary service name of the Gravitino Iceberg REST server. Use **`iceberg-rest`** for the Gravitino Iceberg REST server. | (none) | 0.2.0 | +| `gravitino.auxService.names` | The auxiliary service name of the Gravitino Iceberg REST server. Use **`iceberg-rest`** for the Gravitino Iceberg REST server. | (none) | 0.2.0 | Refer to [Iceberg REST catalog service](iceberg-rest-service.md) for configuration details. @@ -107,8 +106,8 @@ To leverage the event listener, you must implement the `EventListenerPlugin` int | Property name | Description | Default value | Required | Since Version | |----------------------------------------|--------------------------------------------------------------------------------------------------------|---------------|----------|---------------| | `gravitino.eventListener.names` | The name of the event listener, For multiple listeners, separate names with a comma, like "audit,sync" | (none) | Yes | 0.5.0 | -| `gravitino.eventListener.{name}.class` | The class name of the event listener, replace `{name}` with the actual listener name. | (none) | Yes | 0.5.0 | -| `gravitino.eventListener.{name}.{key}` | Custom properties that will be passed to the event listener plugin. | (none) | Yes | 0.5.0 | +| `gravitino.eventListener.{name}.class` | The class name of the event listener, replace `{name}` with the actual listener name. | (none) | Yes | 0.5.0 | +| `gravitino.eventListener.{name}.{key}` | Custom properties that will be passed to the event listener plugin. | (none) | Yes | 0.5.0 | #### Event @@ -149,7 +148,7 @@ The plugin provides several operational modes for how to process event, supporti - **SYNC**: Events are processed synchronously, immediately following the associated operation. This mode ensures events are processed before the operation's result is returned to the client, but it may delay the main process if event processing takes too long. - **ASYNC_SHARED**: This mode employs a shared queue and dispatcher for asynchronous event processing. It prevents the main process from being blocked, though there's a risk events might be dropped if not promptly consumed. Sharing a dispatcher can lead to poor isolation in case of slow listeners. - + - **ASYNC_ISOLATED**: Events are processed asynchronously, with each listener having its own dedicated queue and dispatcher thread. This approach offers better isolation but at the expense of multiple queues and dispatchers. When processing pre-event, you could throw a `ForbiddenException` to skip the following executions. For more details, please refer to the definition of the plugin. @@ -163,8 +162,8 @@ Gravitino provides a default implement to log basic audit information to a file, | Property name | Description | Default value | Required | Since Version | |---------------------------------------|----------------------------------------|---------------------------------------------|----------|----------------------------| | `gravitino.audit.enabled` | The audit log enable flag. | false | NO | 0.7.0-incubating | -| `gravitino.audit.writer.className` | The class name of audit log writer. | org.apache.gravitino.audit.FileAuditWriter | NO | 0.7.0-incubating | -| `gravitino.audit.formatter.className` | The class name of audit log formatter. | org.apache.gravitino.audit.SimpleFormatter | NO | 0.7.0-incubating | +| `gravitino.audit.writer.className` | The class name of audit log writer. | org.apache.gravitino.audit.FileAuditWriter | NO | 0.7.0-incubating | +| `gravitino.audit.formatter.className` | The class name of audit log formatter. | org.apache.gravitino.audit.SimpleFormatter | NO | 0.7.0-incubating | #### Audit log formatter @@ -221,7 +220,6 @@ Below is a list of catalog properties that will be used by all Gravitino catalog | `cloud.name` | The property to specify the cloud that the catalog is running on. The valid values are `aws`, `azure`, `gcp`, `on_premise` and `other`. | (none) | No | 0.6.0-incubating | | `cloud.region-code` | The property to specify the region code of the cloud that the catalog is running on. | (none) | No | 0.6.0-incubating | - The following table lists the catalog specific properties and their default paths: | catalog provider | catalog properties | catalog properties configuration file path | @@ -255,5 +253,5 @@ Currently, due to the absence of a comprehensive user permission system, Graviti Apache Hadoop access. Ensure that the user starting the Gravitino server has Hadoop (HDFS, YARN, etc.) access permissions; otherwise, you may encounter a `Permission denied` error. There are two ways to resolve this error: -* Grant Gravitino startup user permissions in Hadoop -* Specify the authorized Hadoop username in the environment variables `HADOOP_USER_NAME` before starting the Gravitino server. +- Grant Gravitino startup user permissions in Hadoop +- Specify the authorized Hadoop username in the environment variables `HADOOP_USER_NAME` before starting the Gravitino server. diff --git a/docs/hadoop-catalog-index.md b/docs/hadoop-catalog-index.md index dfa7a187175..bc0d1cf3000 100644 --- a/docs/hadoop-catalog-index.md +++ b/docs/hadoop-catalog-index.md @@ -18,9 +18,9 @@ Gravitino Hadoop catalog index includes the following chapters: Apart from the above, you can also refer to the following topics to manage and access cloud storage like S3, GCS, ADLS, and OSS: -- [Using Hadoop catalog to manage S3](./hadoop-catalog-with-s3.md). -- [Using Hadoop catalog to manage GCS](./hadoop-catalog-with-gcs.md). -- [Using Hadoop catalog to manage ADLS](./hadoop-catalog-with-adls.md). -- [Using Hadoop catalog to manage OSS](./hadoop-catalog-with-oss.md). +- [Using Hadoop catalog to manage S3](./hadoop-catalog-with-s3.md). +- [Using Hadoop catalog to manage GCS](./hadoop-catalog-with-gcs.md). +- [Using Hadoop catalog to manage ADLS](./hadoop-catalog-with-adls.md). +- [Using Hadoop catalog to manage OSS](./hadoop-catalog-with-oss.md). -More storage options will be added soon. Stay tuned! \ No newline at end of file +More storage options will be added soon. Stay tuned! diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md index 880166776fd..080f7132dd6 100644 --- a/docs/hadoop-catalog-with-adls.md +++ b/docs/hadoop-catalog-with-adls.md @@ -17,7 +17,7 @@ To set up a Hadoop catalog with ADLS, follow these steps: 3. Start the Gravitino server by running the following command: ```bash -$ ${GRAVITINO_HOME}/bin/gravitino-server.sh start +${GRAVITINO_HOME}/bin/gravitino-server.sh start ``` Once the server is up and running, you can proceed to configure the Hadoop catalog with ADLS. In the rest of this document we will use `http://localhost:8090` as the Gravitino server URL, please replace it with your actual server URL. @@ -32,11 +32,10 @@ Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./ |-------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| | `filesystem-providers` | The file system providers to add. Set it to `abs` if it's a Azure Blob Storage fileset, or a comma separated string that contains `abs` like `oss,abs,s3` to support multiple kinds of fileset including `abs`. | (none) | Yes | 0.8.0-incubating | | `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for Azure Blob Storage, if we set this value, we can omit the prefix 'abfss://' in the location. | `builtin-local` | No | 0.8.0-incubating | -| `azure-storage-account-name ` | The account name of Azure Blob Storage. | (none) | Yes | 0.8.0-incubating | +| `azure-storage-account-name` | The account name of Azure Blob Storage. | (none) | Yes | 0.8.0-incubating | | `azure-storage-account-key` | The account key of Azure Blob Storage. | (none) | Yes | 0.8.0-incubating | | `credential-providers` | The credential provider types, separated by comma, possible value can be `adls-token`, `azure-account-key`. As the default authentication type is using account name and account key as the above, this configuration can enable credential vending provided by Gravitino server and client will no longer need to provide authentication information like account_name/account_key to access ADLS by GVFS. Once it's set, more configuration items are needed to make it works, please see [adls-credential-vending](security/credential-vending.md#adls-credentials) | (none) | No | 0.8.0-incubating | - ### Configurations for a schema Refer to [Schema configurations](./hadoop-catalog.md#schema-properties) for more details. @@ -316,6 +315,7 @@ Before running the following code, you need to install required packages: pip install pyspark==3.1.3 pip install apache-gravitino==${GRAVITINO_VERSION} ``` + Then you can run the following code: ```python @@ -366,7 +366,6 @@ os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-bundle-{gra - [`gravitino-azure-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure) is a condensed version of the Gravitino ADLS bundle jar without Hadoop environment and `hadoop-azure` jar. - `hadoop-azure-3.2.0.jar` and `azure-storage-7.0.0.jar` can be found in the Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory. - Please choose the correct jar according to your environment. :::note @@ -412,7 +411,7 @@ The following are examples of how to use the `hadoop fs` command to access the f 2. Add the necessary jars to the Hadoop classpath. -For ADLS, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar`, `gravitino-azure-${gravitino-version}.jar` and `hadoop-azure-${hadoop-version}.jar` located at `${HADOOP_HOME}/share/hadoop/tools/lib/` to the Hadoop classpath. +For ADLS, you need to add `gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar`, `gravitino-azure-${gravitino-version}.jar` and `hadoop-azure-${hadoop-version}.jar` located at `${HADOOP_HOME}/share/hadoop/tools/lib/` to the Hadoop classpath. 3. Run the following command to access the fileset: @@ -453,7 +452,6 @@ fs = gvfs.GravitinoVirtualFileSystem(server_uri="http://localhost:8090", metalak fs.ls("gvfs://fileset/{adls_catalog}/{adls_schema}/{adls_fileset}/") ``` - ### Using fileset with pandas The following are examples of how to use the pandas library to access the ADLS fileset @@ -539,4 +537,3 @@ spark = SparkSession.builder ``` Python client and Hadoop command are similar to the above examples. - diff --git a/gradle/wrapper/gradle-wrapper.properties b/gradle/wrapper/gradle-wrapper.properties index 18ed075b49a..62670fe108a 100644 --- a/gradle/wrapper/gradle-wrapper.properties +++ b/gradle/wrapper/gradle-wrapper.properties @@ -1,27 +1,8 @@ -# -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. -# -# Refer from https://github.com/gradle/gradle/blob/master/gradle/wrapper/gradle-wrapper.properties distributionBase=GRADLE_USER_HOME distributionPath=wrapper/dists -# checksum was taken from https://gradle.org/release-checksums distributionSha256Sum=38f66cd6eef217b4c35855bb11ea4e9fbc53594ccccb5fb82dfd317ef8c2c5a3 -distributionUrl=https\://services.gradle.org/distributions/gradle-8.2-bin.zip +distributionUrl=https\://services.gradle.org/distributions/gradle-8.13-bin.zip networkTimeout=10000 +validateDistributionUrl=true zipStoreBase=GRADLE_USER_HOME zipStorePath=wrapper/dists diff --git a/gradlew b/gradlew index 23efc17bab5..faf93008b77 100755 --- a/gradlew +++ b/gradlew @@ -15,6 +15,8 @@ # See the License for the specific language governing permissions and # limitations under the License. # +# SPDX-License-Identifier: Apache-2.0 +# ############################################################################## # @@ -55,15 +57,13 @@ # Darwin, MinGW, and NonStop. # # (3) This script is generated from the Groovy template -# https://github.com/gradle/gradle/blob/HEAD/subprojects/plugins/src/main/resources/org/gradle/api/internal/plugins/unixStartScript.txt +# https://github.com/gradle/gradle/blob/HEAD/platforms/jvm/plugins-application/src/main/resources/org/gradle/api/internal/plugins/unixStartScript.txt # within the Gradle project. # # You can find Gradle at https://github.com/gradle/gradle/. # ############################################################################## -# Refer from https://github.com/gradle/gradle/blob/master/gradlew - # Attempt to set APP_HOME # Resolve links: $0 may be a link @@ -85,25 +85,8 @@ done # This is normally unused # shellcheck disable=SC2034 APP_BASE_NAME=${0##*/} -APP_HOME=$( cd "${APP_HOME:-./}" && pwd -P ) || exit - -if [ ! -e $APP_HOME/gradle/wrapper/gradle-wrapper.jar ]; then - GRADLE_WRAPPER_FILENAME=$(grep 'distributionUrl' gradle/wrapper/gradle-wrapper.properties | awk -F '/' '{print $NF}') - # the GRADLE_WRAPPER_FILENAME could be either "gradle-X.Y[.Z]-all.zip" or "gradle-X.Y[.Z]-bin.zip" - GRADLE_VERSION=${GRADLE_WRAPPER_FILENAME#gradle-} - GRADLE_VERSION=${GRADLE_VERSION%.zip} - GRADLE_VERSION=${GRADLE_VERSION%-bin} - GRADLE_VERSION=${GRADLE_VERSION%-all} - # when GRADLE_VERSION is X.Y, the tag is vX.Y.0 - if [ $(echo $GRADLE_VERSION | tr -cd '.' | wc -c) -eq 1 ]; then - GRADLE_TAG="v$GRADLE_VERSION.0" - else - GRADLE_TAG="v$GRADLE_VERSION" - fi - GRADLE_WRAPPER_URL="https://raw.githubusercontent.com/gradle/gradle/${GRADLE_TAG}/gradle/wrapper/gradle-wrapper.jar" - echo "Downloading $GRADLE_WRAPPER_URL" - curl -o $APP_HOME/gradle/wrapper/gradle-wrapper.jar "$GRADLE_WRAPPER_URL" -fi +# Discard cd standard output in case $CDPATH is set (https://github.com/gradle/gradle/issues/25036) +APP_HOME=$( cd -P "${APP_HOME:-./}" > /dev/null && printf '%s\n' "$PWD" ) || exit # Use the maximum available, or set MAX_FD != -1 to use that value. MAX_FD=maximum @@ -150,10 +133,13 @@ location of your Java installation." fi else JAVACMD=java - which java >/dev/null 2>&1 || die "ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH. + if ! command -v java >/dev/null 2>&1 + then + die "ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH. Please set the JAVA_HOME variable in your environment to match the location of your Java installation." + fi fi # Increase the maximum file descriptors if we can. @@ -161,7 +147,7 @@ if ! "$cygwin" && ! "$darwin" && ! "$nonstop" ; then case $MAX_FD in #( max*) # In POSIX sh, ulimit -H is undefined. That's why the result is checked to see if it worked. - # shellcheck disable=SC3045 + # shellcheck disable=SC2039,SC3045 MAX_FD=$( ulimit -H -n ) || warn "Could not query maximum file descriptor limit" esac @@ -169,7 +155,7 @@ if ! "$cygwin" && ! "$darwin" && ! "$nonstop" ; then '' | soft) :;; #( *) # In POSIX sh, ulimit -n is undefined. That's why the result is checked to see if it worked. - # shellcheck disable=SC3045 + # shellcheck disable=SC2039,SC3045 ulimit -n "$MAX_FD" || warn "Could not set maximum file descriptor limit to $MAX_FD" esac @@ -218,11 +204,11 @@ fi # Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script. DEFAULT_JVM_OPTS='"-Xmx64m" "-Xms64m"' -# Collect all arguments for the java command; -# * $DEFAULT_JVM_OPTS, $JAVA_OPTS, and $GRADLE_OPTS can contain fragments of -# shell script including quotes and variable substitutions, so put them in -# double quotes to make sure that they get re-expanded; and -# * put everything else in single quotes, so that it's not re-expanded. +# Collect all arguments for the java command: +# * DEFAULT_JVM_OPTS, JAVA_OPTS, and optsEnvironmentVar are not allowed to contain shell fragments, +# and any embedded shellness will be escaped. +# * For example: A user cannot expect ${Hostname} to be expanded, as it is an environment variable and will be +# treated as '${Hostname}' itself on the command line. set -- \ "-Dorg.gradle.appname=$APP_BASE_NAME" \ diff --git a/spark-connector/spark-common/src/main/java/org/apache/gravitino/spark/connector/iceberg/IcebergPropertiesConstants.java b/spark-connector/spark-common/src/main/java/org/apache/gravitino/spark/connector/iceberg/IcebergPropertiesConstants.java index 1a5ffa7278d..3d6ea418b42 100644 --- a/spark-connector/spark-common/src/main/java/org/apache/gravitino/spark/connector/iceberg/IcebergPropertiesConstants.java +++ b/spark-connector/spark-common/src/main/java/org/apache/gravitino/spark/connector/iceberg/IcebergPropertiesConstants.java @@ -76,7 +76,6 @@ public class IcebergPropertiesConstants { @VisibleForTesting public static final String ICEBERG_FORMAT_VERSION = IcebergConstants.FORMAT_VERSION; - @VisibleForTesting public static final String ICEBERG_CATALOG_CACHE_ENABLED = CatalogProperties.CACHE_ENABLED; static final String GRAVITINO_ICEBERG_CATALOG_BACKEND_NAME =