Skip to content

Commit d568929

Browse files
committed
mini refactor of reproducibility
1 parent a10a985 commit d568929

File tree

1 file changed

+23
-4
lines changed

1 file changed

+23
-4
lines changed

docs/src/understand/use_cases/reproducibility.md

Lines changed: 23 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,9 @@ Data changes frequently. This makes the task of keeping track of its exact state
99

1010
This has a negative impact on the work, as it becomes hard to:
1111

12-
* Debug a data issue.
13-
* Validate machine learning training accuracy (re-running a model over different data gives different results).
14-
* Comply with data audits.
12+
* Debug a data issue
13+
* Validate machine learning training accuracy (re-running a model over different data gives different results)
14+
* Comply with data audits, and model audits in particular
1515

1616
In comparison, lakeFS exposes a Git-like interface to data that allows keeping track of more than just the current state of data. This makes reproducing its state at any point in time straightforward.
1717

@@ -28,7 +28,9 @@ To read data at it’s current state, we can use a static path containing the re
2828

2929
The code above assumes that all objects in the repository under this path are stored in parquet format. If a different format is used, the applicable Spark read method should be used.
3030

31-
In a lakeFS repository, we are capable of taking many commits over the data, making many points in time reproducible.
31+
### Using Commits
32+
33+
In a lakeFS repository, we are capable of taking many [commits](../../understand/glossary.md#commit) over the data, making many points in time reproducible.
3234

3335
![Commit History](../../assets/img/reproduce-commit-history.png)
3436

@@ -42,4 +44,21 @@ df = spark.read.parquet("s3://example/296e54fbee5e176f3f4f4aeb7e087f9d57515750e8
4244

4345
The ability to reference a specific `commit_id` in code simplifies reproducing the specific state a data collection or even multiple collections. This has many applications that are common in data development, such as historical debugging, identifying deltas in a data collection, audit compliance, and more.
4446

47+
### Using Tags
48+
49+
In addition to commits, lakeFS supports [tags](../../understand/glossary.md#tag). A tag is a human-readable label that points to a specific commit.
50+
51+
Tags are useful when you want to mark important points in time, such as:
52+
* A production data release
53+
* A specific model training dataset
54+
* A dataset used for an audit
55+
56+
Instead of referencing a non-readable `commit_id`, you can reference the tag directly in your code. For example:
57+
```python
58+
df = spark.read.parquet("s3://example/v1.0/training_dataset/")
59+
```
60+
61+
Here, `v1.0` is a tag that points to a specific commit. A tag is an immutable reference, it cannot be modified after creation
62+
(only deleted and recreated). Therefore, reading data through a tag will always return the exact same data state.
4563

64+
Using tags makes it easier to work with reproducible datasets in a way that is readable, shareable, and stable over time.

0 commit comments

Comments
 (0)