Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
1f74185
add SLURM integration
svirpioj Feb 4, 2026
689556f
fix slurm issues
svirpioj Feb 11, 2026
597d655
add common get_outpus/inputs utility methods
svirpioj Mar 4, 2026
e1df177
fix SLURM support with YAML variables
svirpioj Mar 4, 2026
9dd1746
fix selecting partition for SLURM
svirpioj Mar 4, 2026
109f318
fix SLURM support with YAML variables
svirpioj Mar 4, 2026
3690d45
fix finding ready steps
svirpioj Mar 4, 2026
555e69a
check for failed jobs
svirpioj Mar 4, 2026
4b34e05
check for failed jobs, fix dependency check
svirpioj Mar 4, 2026
58bad5a
add some logging
svirpioj Mar 4, 2026
20a6cc7
fix get_job_status and use workdir only for logs and scripts
svirpioj Mar 4, 2026
fa89b62
remove unnecessary --monitor switch
svirpioj Mar 4, 2026
3df4663
fix variable expansion in opusfilter-slurm
svirpioj Mar 11, 2026
69d4cbe
fix constant expansion in opusfilter-slurm
svirpioj Mar 11, 2026
169d827
fix dependency graph issues
svirpioj Mar 18, 2026
b43cd49
restructure slurm main loop
svirpioj Mar 18, 2026
4a576d9
add missing step types to get_outputs calls
svirpioj Mar 18, 2026
15553e3
add depends_on field
svirpioj Mar 18, 2026
b9b88e3
add validation for the input configuration
svirpioj Mar 25, 2026
6cc5ccb
add --substep for opusfilter script and use it for SLURM jobs
svirpioj Apr 1, 2026
dfd3fac
use 1-based indexing in sbatch script names
svirpioj Apr 1, 2026
6f404db
fix dependency handling
svirpioj Apr 8, 2026
8a6edf3
add slurm-submit and slurm-status commands
svirpioj Apr 15, 2026
a3fdc4b
mark completed steps
svirpioj Apr 22, 2026
b372425
fix dependency formatting
svirpioj Apr 22, 2026
6cc93ff
add dependencies to slurm-status table
svirpioj Apr 22, 2026
0ea4eab
restore node to slurm-status table
svirpioj Apr 22, 2026
7313bf2
clean up slurm-status table with multiple dependencies
svirpioj Apr 22, 2026
74fd0f3
fix squeue / sacct commands
svirpioj Apr 22, 2026
b10aa63
fix squeue / sacct commands
svirpioj Apr 22, 2026
41f77bb
improve manifest handling in slurm scripts
svirpioj Apr 29, 2026
bfc0e7f
fix squeue / sacct commands
svirpioj Apr 29, 2026
694e818
fix resuming from existing slurm manifest
svirpioj Apr 29, 2026
a20c892
update CHANGELOG
svirpioj Jun 10, 2026
7d16ce4
add hf_read to known step types
svirpioj Jun 10, 2026
57359d6
change default opusfilter-work path and improve slurm-run exit logic
svirpioj Jun 10, 2026
afb6425
fix get_job_status for purged jobs
svirpioj Jun 10, 2026
42847d1
fix index for script and job names
svirpioj Jun 10, 2026
1ed3ba9
use text_file_open in ParallelWrapper
svirpioj Jun 10, 2026
9c1a4ef
use indices starting from 1 more consistently
svirpioj Jun 10, 2026
b912171
add inline filtering to hf_read and opus_read
svirpioj Jun 10, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions bin/opusfilter-slurm
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/usr/bin/env python3

import sys
import warnings

from opusfilter.cli.slurm import main

warnings.warn(
"opusfilter-slurm is experimental. Please report issues at "
"https://github.com/Helsinki-NLP/OpusFilter/issues",
UserWarning,
stacklevel=2
)

sys.exit(main())
1 change: 1 addition & 0 deletions docs/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- support for Python 3.14
- `hf_read` function for reading data from Hugging Face datasets
- support for encoding text in JSONL files to preserve newlines
- `opusfilter-slurm-*` commands for running OpusFilter workflows in SLRUM clusters

### Changed

Expand Down
63 changes: 63 additions & 0 deletions docs/functions/downloading_and_selecting_data.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ Parameters:
* `id_field`: field name for alignment between configs in cross-config mode (required for cross-config)
* `newline_replacement`: character to replace embedded newlines with in plain-text output (default: space)
* `split`: which dataset split to use (default: `train`)
* `filters`: an optional list of filter configurations (same format as the `filter` step) to apply before writing — avoids a separate `filter` step after reading
* `filterfalse`: if `true` and `filters` is set, write only pairs that are *rejected* by the filters (default: `false`)
* `streaming`: whether to stream the dataset (default: `true`)
* `trust_remote_code`: whether to trust remote code in dataset loading (default: `false`)
* `kwargs`: additional keyword arguments passed to `datasets.load_dataset`
Expand All @@ -43,6 +45,65 @@ For datasets with multi-line text (e.g. web-crawled data), use the
`.jsonl` file extension in `src_output` and `tgt_output` to produce
JSONL output, which preserves embedded newlines within each segment.

Example: loading a selection of language pairs from
`Helsinki-NLP/nemotron-cc-translated` using cross-config mode:

```yaml
steps:
- type: hf_read
parameters:
dataset: Helsinki-NLP/nemotron-cc-translated
src_config: eng
tgt_config: !var lang
id_field: warc_record_id
src_field: text
tgt_field: text
src_output: !varstr "nemotron/{lang}.en.jsonl.gz"
tgt_output: !varstr "nemotron/en.{lang}.jsonl.gz"
variables:
lang: [eus, gle, glg, isl, mkd, swe]
```

The dataset uses per-language configs named with
[ISO 639-3](https://en.wikipedia.org/wiki/ISO_639-3) three-letter codes
(e.g. `eus` for Basque, `isl` for Icelandic). Each config has a `text`
field and a `warc_record_id` field that aligns sentences across
languages. The `id_field: warc_record_id` parameter ensures correct
alignment even when language configs have different row counts.

Inline filtering can be added to avoid writing all pairs before
filtering — data flows from the HF dataset through the filter pipeline
and only accepted pairs are written to disk:

```yaml
steps:
- type: hf_read
parameters:
dataset: Helsinki-NLP/nemotron-cc-translated
src_config: eng
tgt_config: !var lang
id_field: warc_record_id
src_field: text
tgt_field: text
src_output: !varstr "filtered.{lang}.jsonl.gz"
tgt_output: !varstr "filtered.eng.jsonl.gz"
max_rows: 50000
filters:
- LengthFilter:
min_length: 3
max_length: 250
unit: word
- LengthRatioFilter:
threshold: 3
unit: char
- AlphabetRatioFilter:
threshold: 0.5
- LongWordFilter:
threshold: 40
variables:
lang: [eus, gle, glg, isl, mkd, swe]
```

## opus_read

Read a corpus from the OPUS corpus collection {cite:p}`tiedemann-2012-parallel` using
Expand All @@ -58,6 +119,8 @@ Parameters:
* `src_output`: output file for source language
* `tgt_output`: output file for target language
* `suppress_prompts`: `false` (default) prompts user to confirm before download, `true` to download without prompting
* `filters`: an optional list of filter configurations (same format as the `filter` step) to apply before writing — avoids a separate `filter` step after reading. OpusTools output is written to a temporary file, filtered, and then written to the final output.
* `filterfalse`: if `true` and `filters` is set, write only pairs that are *rejected* by the filters (default: `false`)

The `moses` preprocessing type (available with `OpusTools` version
1.6.2 and above) is recommended for those corpora for which it
Expand Down
Loading