Skip to content
This repository was archived by the owner on Jul 29, 2024. It is now read-only.

[QA] Some fixes part 1 #26

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 0 additions & 58 deletions src/pages/latest/delta-column-mapping.mdx

This file was deleted.

13 changes: 5 additions & 8 deletions src/pages/latest/delta-intro.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,26 +14,23 @@ metadata handling, and unifies [streaming](delta-streaming.md) and
[batch](delta-batch.md) data processing on top of existing data lakes, such as
S3, ADLS, GCS, and HDFS.

For a quick overview and benefits of Delta Lake, watch this YouTube video (3
minutes).

Specifically, Delta Lake offers:

- [ACID transactions](concurrency-control.md) on Spark: Serializable isolation
- [ACID transactions](/latest/concurrency-control) on Spark: Serializable isolation
levels ensure that readers never see inconsistent data.
- Scalable metadata handling: Leverages Spark distributed processing power to
handle all the metadata for petabyte-scale tables with billions of files at
ease.
- [Streaming](delta-streaming.md) and [batch](delta-batch.md) unification: A
- [Streaming](/latest/delta-streaming) and [batch](/latest/delta-batch) unification: A
table in Delta Lake is a batch table as well as a streaming source and sink.
Streaming data ingest, batch historic backfill, interactive queries all just
work out of the box.
- Schema enforcement: Automatically handles schema variations to prevent
insertion of bad records during ingestion.
- [Time travel](delta-batch.md#deltatimetravel): Data versioning enables
- [Time travel](/latest/delta-batch#deltatimetravel): Data versioning enables
rollbacks, full historical audit trails, and reproducible machine learning
experiments.
- [Upserts](delta-update.md#delta-merge) and
[deletes](delta-update.md#delta-delete): Supports merge, update and delete
- [Upserts](/latest/delta-update#upsert-into-a-table-using-merge) and
[deletes](/latest/delta-update#delete-from-a-table): Supports merge, update and delete
operations to enable complex use cases like change-data-capture,
slowly-changing-dimension (SCD) operations, streaming upserts, and so on.
2 changes: 1 addition & 1 deletion src/pages/latest/delta-streaming.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -273,7 +273,7 @@ The preceding example continuously updates a table that contains the aggregate n

For applications with more lenient latency requirements, you can save computing resources with one-time triggers. Use these to update summary aggregation tables on a given schedule, processing only new data that has arrived since the last update.

## Idempotent table writes in `foreachBatch`
## Idempotent table writes in foreachBatch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, <code>...</code> works and looks better in Header2s (##).

Anyways, it seems we both did delta-streaming. https://github.com/delta-io/delta-docs/pull/39/files#diff-87d0d2367f1ffef73543eec8952b76ba882ec93954c372f7eab0ab85d3832f6fR284

Want to drop these changes from this PR? And we will use mine instead?


<Info title="Note" level="info">
Available in Delta Lake 2.0.0 and above.
Expand Down
43 changes: 0 additions & 43 deletions src/pages/latest/getting-started.mdx

This file was deleted.

27 changes: 8 additions & 19 deletions src/pages/latest/optimizations-oss.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,10 @@ Delta Lake provides optimizations that accelerate data lake operations.

To improve query speed, Delta Lake supports the ability to optimize the layout of data in storage. There are various ways to optimize the layout.

<a id="compaction-binpacking"></a>
### Compaction (bin-packing)

<Info title="Note" level="info">
Note

This feature is available in Delta Lake 1.2.0 and above.
</Info>

Expand Down Expand Up @@ -58,11 +57,9 @@ deltaTable.optimize().where("date='2021-11-18'").executeCompaction()

</CodeTabs>

For Scala, Java, and Python API syntax details, see the [Delta Lake APIs](/latest/delta-apidoc.html).
For Scala, Java, and Python API syntax details, see the [Delta Lake APIs](/latest/delta-apidoc).

<Info title="Note" level="info">
Note

* Bin-packing optimization is _idempotent_, meaning that if it is run twice on the same dataset, the second run has no effect.

* Bin-packing aims to produce evenly-balanced data files with respect to their size on disk, but not necessarily number of tuples per file. However, the two measures are most often correlated.
Expand All @@ -77,20 +74,17 @@ Readers of Delta tables use snapshot isolation, which means that they are not in
## Data skipping

<Info title="Note" level="info">
Note

This feature is available in Delta Lake 1.2.0 and above.
</Info>

Data skipping information is collected automatically when you write data into a Delta Lake table. Delta Lake takes advantage of this information (minimum and maximum values for each column) at query time to provide faster queries. You do not need to configure data skipping; the feature is activated whenever applicable. However, its effectiveness depends on the layout of your data. For best results, apply [Z-Ordering](/latest/optimizations-oss.html#-z-ordering-multi-dimensional-clustering).
Data skipping information is collected automatically when you write data into a Delta Lake table. Delta Lake takes advantage of this information (minimum and maximum values for each column) at query time to provide faster queries. You do not need to configure data skipping; the feature is activated whenever applicable. However, its effectiveness depends on the layout of your data. For best results, apply [Z-Ordering](#zordering-multidimensional-clustering).

Collecting statistics on a column containing long values such as string or binary is an expensive operation. To avoid collecting statistics on such columns you can configure the [table property](/latest/delta-batch.html#-table-properties) `delta.dataSkippingNumIndexedCols`. This property indicates the position index of a column in the table’s schema. All columns with a position index less than the `delta.dataSkippingNumIndexedCols` property will have statistics collected. For the purposes of collecting statistics, each field within a nested column is considered as an individual column. To avoid collecting statistics on columns containing long values, either set the `delta.dataSkippingNumIndexedCols` property so that the long value columns are after this index in the table’s schema, or move columns containing long strings to an index position greater than the `delta.dataSkippingNumIndexedCols` property by using `[ALTER TABLE ALTER COLUMN](/latest/sql-ref-syntax-ddl-alter-table.html#alter-or-change-column)`.
Collecting statistics on a column containing long values such as string or binary is an expensive operation. To avoid collecting statistics on such columns you can configure the [table property](/latest/table-properties) `delta.dataSkippingNumIndexedCols`. This property indicates the position index of a column in the table’s schema. All columns with a position index less than the `delta.dataSkippingNumIndexedCols` property will have statistics collected. For the purposes of collecting statistics, each field within a nested column is considered as an individual column. To avoid collecting statistics on columns containing long values, either set the `delta.dataSkippingNumIndexedCols` property so that the long value columns are after this index in the table’s schema, or move columns containing long strings to an index position greater than the `delta.dataSkippingNumIndexedCols` property by using [`ALTER TABLE ALTER COLUMN`](https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-alter-table.html#alter-or-change-column).

<a id="zordering-multidimensional-clustering"></a>
## Z-Ordering (multi-dimensional clustering)

<Info title="Note" level="info">
Note

This feature is available in Delta Lake 2.0.0 and above.
</Info>

Expand Down Expand Up @@ -131,27 +125,24 @@ deltaTable.optimize().where("date='2021-11-18'").executeZOrderBy(eventType)

</CodeTabs>

For Scala, Java, and Python API syntax details, see the [Delta Lake APIs](/latest/delta-apidoc.html).
For Scala, Java, and Python API syntax details, see the [Delta Lake APIs](/latest/delta-apidoc).

If you expect a column to be commonly used in query predicates and if that column has high cardinality (that is, a large number of distinct values), then use `ZORDER BY`.

You can specify multiple columns for `ZORDER BY` as a comma-separated list. However, the effectiveness of the locality drops with each extra column. Z-Ordering on columns that do not have statistics collected on them would be ineffective and a waste of resources. This is because data skipping requires column-local stats such as min, max, and count. You can configure statistics collection on certain columns by reordering columns in the schema, or you can increase the number of columns to collect statistics on. See [Data skipping](https://docs.delta.io/latest/optimizations-oss.html#-data-skipping).
You can specify multiple columns for `ZORDER BY` as a comma-separated list. However, the effectiveness of the locality drops with each extra column. Z-Ordering on columns that do not have statistics collected on them would be ineffective and a waste of resources. This is because data skipping requires column-local stats such as min, max, and count. You can configure statistics collection on certain columns by reordering columns in the schema, or you can increase the number of columns to collect statistics on. See [Data skipping](#data-skipping).

<Info title="Note" level="info">
Note

* Z-Ordering is _not idempotent_. Everytime the Z-Ordering is executed it will try to create a new clustering of data in all files (new and existing files that were part of previous Z-Ordering) in a partition.

* Z-Ordering aims to produce evenly-balanced data files with respect to the number of tuples, but not necessarily data size on disk. The two measures are most often correlated, but there can be situations when that is not the case, leading to skew in optimize task times.

* For example, if you `ZORDER BY` _date_ and your most recent records are all much wider (for example longer arrays or string values) than the ones in the past, it is expected that the `OPTIMIZE` job’s task durations will be skewed, as well as the resulting file sizes. This is, however, only a problem for the `OPTIMIZE` command itself; it should not have any negative impact on subsequent queries.
</Info>

<a id="multipart-checkpointing"></a>
## Multi-part checkpointing

<Info title="Note" level="info">
Note

This feature is available in Delta Lake 2.0.0 and above. This feature is in experimental support mode.
</Info>

Expand All @@ -160,7 +151,5 @@ Delta Lake table periodically and automatically compacts all the incremental upd
Delta Lake protocol allows [splitting the checkpoint](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#checkpoints) into multiple Parquet files. This parallelizes and speeds up writing the checkpoint. In Delta Lake, by default each checkpoint is written as a single Parquet file. To to use this feature, set the SQL configuration `spark.databricks.delta.checkpoint.partSize=<n>`, where `n` is the limit of number of actions (such as `AddFile`) at which Delta Lake on Apache Spark will start parallelizing the checkpoint and attempt to write a maximum of this many actions per checkpoint file.

<Info title="Note" level="info">
Note

This feature requires no reader side configuration changes. The existing reader already supports reading a checkpoint with multiple files.
</Info>
Loading