Skip to content

Commit 173cc03

Browse files
authored
docs: update outdated eventual consistency wording for cloud object stores (#12424)
Refreshes the Architecture page so it no longer describes modern cloud object stores as eventually consistent. AWS S3 has provided strong read-after-write consistency since December 2020, and GCS / Azure Blob have been strongly consistent since launch. The technical motivation for Nessie is preserved: cloud object stores still lack atomic multi-object swap / rename, which is the real reason a Hive-metastore-like component (or Nessie) is needed. Closes #5349 Signed-off-by: mj006648 <uckdekf@gmail.com>
1 parent 476ce08 commit 173cc03

1 file changed

Lines changed: 5 additions & 3 deletions

File tree

site/docs/develop/index.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Architecture
22

33
Nessie builds on the recent ecosystem developments around table formats. The rise of
4-
very large metadata and eventually consistent cloud data lakes (S3 specifically) drove
4+
very large metadata and cloud data lakes (S3 specifically) drove
55
the need for an updated model around metadata management. Where consistent directory
66
listings in HDFS used to be sufficient, there were many features lacking. This includes
77
snapshotting, consistency and fast planning. Apache Iceberg was created to help alleviate
@@ -21,8 +21,10 @@ require a pointer to the active metadata set to function. This pointer allows th
2121
current schema, files and partitions in the dataset. Iceberg currently relies on the Hive metastore or hdfs to perform
2222
this role. The requirements for this root pointer store is it must hold (at least) information about the location of the
2323
current up-to-date metadata file, and it must be able to update this location atomically. In Hive this is accomplished by
24-
locks and in hdfs by using atomic file swap operations. These operations don’t exist in eventually consistent cloud
25-
object stores, necessitating a Hive metastore for cloud data lakes. The Nessie system is designed to store the
24+
locks and in hdfs by using atomic file swap operations. While modern cloud object stores
25+
(S3 since December 2020, GCS and Azure Blob since launch) provide strong read-after-write
26+
consistency, they still lack atomic multi-object swap operations, necessitating a Hive
27+
metastore for cloud data lakes. The Nessie system is designed to store the
2628
root metadata pointer and perform atomic updates to this pointer, obviating the need for a Hive metastore. Removing the
2729
need for a Hive metastore simplifies deployment and broadens the reach of tools that can work with Iceberg tables.
2830

0 commit comments

Comments
 (0)