You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/understand/use_cases/reproducibility.md
+23-4Lines changed: 23 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,9 +9,9 @@ Data changes frequently. This makes the task of keeping track of its exact state
9
9
10
10
This has a negative impact on the work, as it becomes hard to:
11
11
12
-
* Debug a data issue.
13
-
* Validate machine learning training accuracy (re-running a model over different data gives different results).
14
-
* Comply with data audits.
12
+
* Debug a data issue
13
+
* Validate machine learning training accuracy (re-running a model over different data gives different results)
14
+
* Comply with data audits, and model audits in particular
15
15
16
16
In comparison, lakeFS exposes a Git-like interface to data that allows keeping track of more than just the current state of data. This makes reproducing its state at any point in time straightforward.
17
17
@@ -28,7 +28,9 @@ To read data at it’s current state, we can use a static path containing the re
28
28
29
29
The code above assumes that all objects in the repository under this path are stored in parquet format. If a different format is used, the applicable Spark read method should be used.
30
30
31
-
In a lakeFS repository, we are capable of taking many commits over the data, making many points in time reproducible.
31
+
### Using Commits
32
+
33
+
In a lakeFS repository, we are capable of taking many [commits](../../understand/glossary.md#commit) over the data, making many points in time reproducible.
The ability to reference a specific `commit_id` in code simplifies reproducing the specific state a data collection or even multiple collections. This has many applications that are common in data development, such as historical debugging, identifying deltas in a data collection, audit compliance, and more.
44
46
47
+
### Using Tags
48
+
49
+
In addition to commits, lakeFS supports [tags](../../understand/glossary.md#tag). A tag is a human-readable label that points to a specific commit.
50
+
51
+
Tags are useful when you want to mark important points in time, such as:
52
+
* A production data release
53
+
* A specific model training dataset
54
+
* A dataset used for an audit
55
+
56
+
Instead of referencing a non-readable `commit_id`, you can reference the tag directly in your code. For example:
0 commit comments