You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/process.md
+22Lines changed: 22 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -71,6 +71,20 @@ You can configure parameters by providing a custom config file. You can find an
71
71
72
72
:rotating_light: Not all parameters are configurable yet :wink:
73
73
74
+
### :recycle: Incremental reprocessing
75
+
76
+
The optional top-level `previous_results` parameter lets you reuse results from a prior run to avoid reprocessing unchanged files so as to save time and compute costs.
Point it to a `merged_results.jsonl` produced by an earlier run. On the next run, each local input file is compared against that JSONL (meanwhile URL inputs are always reprocessed):
83
+
84
+
- Unchanged files: their previous samples are reused as-is.
85
+
- New or modified files: they are processed normally.
86
+
- Removed files: their samples are dropped from the output.
87
+
74
88
## :scroll: More information on what's under the hood
Specify with `--input-data` the path (absolute or relative to the root of the repository) to the JSONL recoding of the output of the initial processing phase.
129
143
144
+
### :recycle: Incremental post-processing
145
+
146
+
Like the processing pipeline, the post-processor accepts an optional `previous_results` parameter to reuse results from a prior post-processing run and skip unchanged documents.
New post-processors can easily be implemented, and pipelines can be configured through lightweight YAML files. The post-processing stage produces a new JSONL file containing cleaned and optionally enhanced document samples.
0 commit comments