From 7448b9881485b2c0507e89f75af24988c37ccca0 Mon Sep 17 00:00:00 2001 From: Avril Aysha <68642378+avriiil@users.noreply.github.com> Date: Sat, 1 Jun 2024 17:46:01 +0100 Subject: [PATCH 1/3] add context + api docs link --- docs/usage/deleting-rows-from-delta-lake-table.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/usage/deleting-rows-from-delta-lake-table.md b/docs/usage/deleting-rows-from-delta-lake-table.md index e1833c84b9..6c852100e6 100644 --- a/docs/usage/deleting-rows-from-delta-lake-table.md +++ b/docs/usage/deleting-rows-from-delta-lake-table.md @@ -32,3 +32,7 @@ Here are the contents of the Delta table after the delete operation has been per | 2 | b | +-------+----------+ ``` + +`dt.delete()` accepts any `SQL where` clause. If no predicate is provided, all rows will be deleted. + +Read more in the [API docs](https://delta-io.github.io/delta-rs/api/delta_table/#deltalake.DeltaTable.delete) From ee7e00f866553debdd56c78475eec57b6463a95c Mon Sep 17 00:00:00 2001 From: Avril Aysha <68642378+avriiil@users.noreply.github.com> Date: Tue, 24 Sep 2024 12:20:47 +0100 Subject: [PATCH 2/3] create gcs docs --- docs/integrations/object-storage/gcs.md | 87 +++++++++++++++++++++++++ 1 file changed, 87 insertions(+) create mode 100644 docs/integrations/object-storage/gcs.md diff --git a/docs/integrations/object-storage/gcs.md b/docs/integrations/object-storage/gcs.md new file mode 100644 index 0000000000..c5592ccc5c --- /dev/null +++ b/docs/integrations/object-storage/gcs.md @@ -0,0 +1,87 @@ +# GCS Storage Backend + +`delta-rs` offers native support for using Google Cloud Storage (GCS) as an object storage backend. + +You don’t need to install any extra dependencies to red/write Delta tables to S3 with engines that use `delta-rs`. You do need to configure your AWS access credentials correctly. + +## Note for boto3 users + +Many Python engines use [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to connect to AWS. This library supports reading credentials automatically from your local `.aws/config` or `.aws/creds` file. + +For example, if you’re running locally with the proper credentials in your local `.aws/config` or `.aws/creds` file then you can write a Parquet file to S3 like this with pandas: + +```python + import pandas as pd + df = pd.DataFrame({'x': [1, 2, 3]}) + df.to_parquet("s3://avriiil/parquet-test-pandas") +``` + +The `delta-rs` writer does not use `boto3` and therefore does not support taking credentials from your `.aws/config` or `.aws/creds` file. If you’re used to working with writers from Python engines like Polars, pandas or Dask, this may mean a small change to your workflow. + +## Passing AWS Credentials + +You can pass your AWS credentials explicitly by using: + +- the `storage_options `kwarg +- Environment variables +- EC2 metadata if using EC2 instances +- AWS Profiles + +## Example + +Let's work through an example with Polars. The same logic applies to other Python engines like Pandas, Daft, Dask, etc. + +Follow the steps below to use Delta Lake on S3 with Polars: + +1. Install Polars and deltalake. For example, using: + + `pip install polars deltalake` + +2. Create a dataframe with some toy data. + + `df = pl.DataFrame({'x': [1, 2, 3]})` + +3. Set your `storage_options` correctly. + +```python +storage_options = { + "AWS_REGION":, + 'AWS_ACCESS_KEY_ID': , + 'AWS_SECRET_ACCESS_KEY': , + 'AWS_S3_LOCKING_PROVIDER': 'dynamodb', + 'DELTA_DYNAMO_TABLE_NAME': 'delta_log', +} +``` + +4. Write data to Delta table using the `storage_options` kwarg. + + ```python + df.write_delta( + "s3://bucket/delta_table", + storage_options=storage_options, + ) + ``` + +## Delta Lake on AWS S3: Safe Concurrent Writes + +You need a locking provider to ensure safe concurrent writes when writing Delta tables to AWS S3. This is because AWS S3 does not guarantee mutual exclusion. + +A locking provider guarantees that only one writer is able to create the same file. This prevents corrupted or conflicting data. + +`delta-rs` uses DynamoDB to guarantee safe concurrent writes. + +Run the code below in your terminal to create a DynamoDB table that will act as your locking provider. + +``` + aws dynamodb create-table \ + --table-name delta_log \ + --attribute-definitions AttributeName=tablePath,AttributeType=S AttributeName=fileName,AttributeType=S \ + --key-schema AttributeName=tablePath,KeyType=HASH AttributeName=fileName,KeyType=RANGE \ + --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5 +``` + +If for some reason you don't want to use DynamoDB as your locking mechanism you can choose to set the `AWS_S3_ALLOW_UNSAFE_RENAME` variable to `true` in order to enable S3 unsafe writes. + +Read more in the [Usage](../../usage/writing/writing-to-s3-with-locking-provider.md) section. + +## Delta Lake on GCS: Required permissions From 5fa0d8a8890caf69c934b5cd6fd502eae1ec43b2 Mon Sep 17 00:00:00 2001 From: Avril Aysha <68642378+avriiil@users.noreply.github.com> Date: Tue, 24 Sep 2024 15:33:53 +0100 Subject: [PATCH 3/3] update docs --- docs/integrations/object-storage/gcs.md | 91 +++++++------------------ 1 file changed, 24 insertions(+), 67 deletions(-) diff --git a/docs/integrations/object-storage/gcs.md b/docs/integrations/object-storage/gcs.md index c5592ccc5c..aa8682d3cc 100644 --- a/docs/integrations/object-storage/gcs.md +++ b/docs/integrations/object-storage/gcs.md @@ -2,86 +2,43 @@ `delta-rs` offers native support for using Google Cloud Storage (GCS) as an object storage backend. -You don’t need to install any extra dependencies to red/write Delta tables to S3 with engines that use `delta-rs`. You do need to configure your AWS access credentials correctly. +You don’t need to install any extra dependencies to read/write Delta tables to GCS with engines that use `delta-rs`. You do need to configure your GCS access credentials correctly. -## Note for boto3 users +## Using Application Default Credentials -Many Python engines use [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to connect to AWS. This library supports reading credentials automatically from your local `.aws/config` or `.aws/creds` file. +Application Default Credentials (ADC) is a strategy used by GCS to automatically find credentials based on the application environment. -For example, if you’re running locally with the proper credentials in your local `.aws/config` or `.aws/creds` file then you can write a Parquet file to S3 like this with pandas: +If you are working from your local machine and have ADC set up then you can read/write Delta tables from GCS directly, without having to pass your credentials explicitly. -```python - import pandas as pd - df = pd.DataFrame({'x': [1, 2, 3]}) - df.to_parquet("s3://avriiil/parquet-test-pandas") -``` - -The `delta-rs` writer does not use `boto3` and therefore does not support taking credentials from your `.aws/config` or `.aws/creds` file. If you’re used to working with writers from Python engines like Polars, pandas or Dask, this may mean a small change to your workflow. - -## Passing AWS Credentials - -You can pass your AWS credentials explicitly by using: - -- the `storage_options `kwarg -- Environment variables -- EC2 metadata if using EC2 instances -- AWS Profiles - -## Example - -Let's work through an example with Polars. The same logic applies to other Python engines like Pandas, Daft, Dask, etc. - -Follow the steps below to use Delta Lake on S3 with Polars: - -1. Install Polars and deltalake. For example, using: - - `pip install polars deltalake` +## Example: Write Delta tables to GCS with Polars -2. Create a dataframe with some toy data. - - `df = pl.DataFrame({'x': [1, 2, 3]})` - -3. Set your `storage_options` correctly. +Using Polars, you can write a Delta table to GCS like this: ```python -storage_options = { - "AWS_REGION":, - 'AWS_ACCESS_KEY_ID': , - 'AWS_SECRET_ACCESS_KEY': , - 'AWS_S3_LOCKING_PROVIDER': 'dynamodb', - 'DELTA_DYNAMO_TABLE_NAME': 'delta_log', -} -``` - -4. Write data to Delta table using the `storage_options` kwarg. - - ```python - df.write_delta( - "s3://bucket/delta_table", - storage_options=storage_options, - ) - ``` +# create a toy dataframe +import polars as pl +df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]}) -## Delta Lake on AWS S3: Safe Concurrent Writes +# define path +table_path = "gs://bucket/delta-table" -You need a locking provider to ensure safe concurrent writes when writing Delta tables to AWS S3. This is because AWS S3 does not guarantee mutual exclusion. +# write Delta to GCS +df.write_delta(table_path) +``` -A locking provider guarantees that only one writer is able to create the same file. This prevents corrupted or conflicting data. +## Passing GCS Credentials explicitly -`delta-rs` uses DynamoDB to guarantee safe concurrent writes. +Alternatively, you can pass GCS credentials to your query engine explicitly. -Run the code below in your terminal to create a DynamoDB table that will act as your locking provider. +For Polars, you would do this using the `storage_options` keyword. This will forward your credentials to the `object store` library that Polars uses under the hood. Read the [Polars documentation](https://docs.pola.rs/api/python/stable/reference/api/polars.DataFrame.write_delta.html) and the [`object store` documentation](https://docs.rs/object_store/latest/object_store/gcp/enum.GoogleConfigKey.html#variants) for more information. -``` - aws dynamodb create-table \ - --table-name delta_log \ - --attribute-definitions AttributeName=tablePath,AttributeType=S AttributeName=fileName,AttributeType=S \ - --key-schema AttributeName=tablePath,KeyType=HASH AttributeName=fileName,KeyType=RANGE \ - --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5 -``` +## Delta Lake on GCS: Required permissions -If for some reason you don't want to use DynamoDB as your locking mechanism you can choose to set the `AWS_S3_ALLOW_UNSAFE_RENAME` variable to `true` in order to enable S3 unsafe writes. +You will need the following permissions in your GCS account: -Read more in the [Usage](../../usage/writing/writing-to-s3-with-locking-provider.md) section. +- `storage.objects.create` +- `storage.objects.delete` (only required for uploads that overwrite an existing object) +- `storage.objects.get` (only required if you plan on using the Google Cloud CLI) +- `storage.objects.list` (only required if you plan on using the Google Cloud CLI) -## Delta Lake on GCS: Required permissions +For more information, see the [GCP documentation](https://cloud.google.com/storage/docs/uploading-objects)