Skip to content
This repository was archived by the owner on Apr 30, 2026. It is now read-only.

Commit c7563b7

Browse files
committed
Merge remote-tracking branch 'upstream/main' into hybrid-chunker
2 parents a8273f4 + 2cc9889 commit c7563b7

11 files changed

Lines changed: 144 additions & 26 deletions

File tree

.github/mergify.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ pull_request_rules:
6767
- or:
6868
- files~=\.py$
6969
- files=pyproject.toml
70-
- files=^requirements.*\.txt$
70+
- files~=^requirements.*\.txt$
7171
- files=.github/workflows/functional-gpu-nvidia-t4-x1.yml
7272
- and:
7373
- -files~=\.py$

.github/workflows/lint.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ jobs:
7474
fetch-depth: 0
7575

7676
- name: Setup Python 3.11
77-
uses: actions/setup-python@42375524e23c412d93fb67b49958b491fce71c38 # v5.4.0
77+
uses: actions/setup-python@8d9ed9ac5c53483de85588cdf95a591a75ab9f55 # v5.5.0
7878
with:
7979
python-version: 3.11
8080
cache: pip

.github/workflows/pypi.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ jobs:
7272
egress-policy: audit # TODO: change to 'egress-policy: block' after couple of runs
7373

7474
- name: "Download build artifacts"
75-
uses: actions/download-artifact@fa0a91b85d4f404e444e00e005971372dc801d16 # v4.1.8
75+
uses: actions/download-artifact@95815c38cf2ff2164869cbab79da8d1f422bc89e # v4.2.1
7676
with:
7777
name: Packages
7878
path: dist
@@ -104,7 +104,7 @@ jobs:
104104
egress-policy: audit # TODO: change to 'egress-policy: block' after couple of runs
105105

106106
- name: "Download build artifacts"
107-
uses: actions/download-artifact@fa0a91b85d4f404e444e00e005971372dc801d16 # v4.1.8
107+
uses: actions/download-artifact@95815c38cf2ff2164869cbab79da8d1f422bc89e # v4.2.1
108108
with:
109109
name: Packages
110110
path: dist

.github/workflows/test.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ jobs:
8080
brew install expect coreutils bash
8181
8282
- name: Setup Python ${{ matrix.python }}
83-
uses: actions/setup-python@42375524e23c412d93fb67b49958b491fce71c38 # v5.4.0
83+
uses: actions/setup-python@8d9ed9ac5c53483de85588cdf95a591a75ab9f55 # v5.5.0
8484
with:
8585
python-version: ${{ matrix.python }}
8686
cache: pip
@@ -93,7 +93,7 @@ jobs:
9393
pip cache remove llama_cpp_python
9494
9595
- name: Cache huggingface
96-
uses: actions/cache@0c907a75c2c80ebcb7f088228285e798b750cf8f # v4.2.1
96+
uses: actions/cache@5a3ec84eff668545956fd18022155c47e93e2684 # v4.2.3
9797
with:
9898
path: ~/.cache/huggingface
9999
# config contains DEFAULT_MODEL

CHANGELOG.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,18 @@ Each `LLMBlock` in a `Pipeline` can now specify `model_family` or `model_id` in
1010

1111
The parameters `model_family`, `model_id`, and `num_instructions_to_generate` are no longer required in `PipelineContext` objects. They used to be required, and if passed in will still get used as before. However, they can now be omitted if your `Pipeline` contains no `LLMBlock` entries or if your `LLMBlock` config specifies these values in the `Pipeline` yaml.
1212

13+
## v0.7.2
14+
15+
### Fixes
16+
17+
* When chunking knowledge documents, PDF or Markdown documents containing a table would often result in a "list index out of range". The cases for that error resulting from the chunking of table content are now fixed. We've also had users report other cases where a "list index out of range" error can show up in the version of Docling we rely on, and those specific cases won't be fixed until we upgrade the Docling version.
18+
19+
## v0.7.1
20+
21+
### Fixes
22+
23+
* When mixing datasets, we were not always properly plumbing through the user's expected system prompt into the samples of the mixed dataset. And, specifically for the new `mix_datasets` API added in v0.7.0, we were never setting the system prompt. This adds that as a parameter to that API and ensures we use it when creating our mixed datasets.
24+
1325
## v0.7.0
1426

1527
### Features

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# SPDX-License-Identifier: Apache-2.0
22

33
[build-system]
4-
requires = ["setuptools>=64", "setuptools_scm>=8"]
4+
requires = ["setuptools>=78.1.0", "setuptools_scm>=8"]
55
build-backend = "setuptools.build_meta"
66

77
[project]

requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# SPDX-License-Identifier: Apache-2.0
22
click>=8.1.7,<9.0.0
3-
datasets>=2.18.0,<3.0.0
3+
datasets>=2.18.0
44
docling-core[chunking]>=2.9.0
55
docling[tesserocr]>=2.9.0; sys_platform != 'darwin'
66
docling>=2.9.0; sys_platform == 'darwin'

src/instructlab/sdg/utils/chunkers.py

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,18 @@
55
import json
66
import logging
77
import os
8+
import os
89
import re
910
import sys
11+
import sys
1012

1113
# Third Party
1214
from datasets import Dataset
1315
from docling.datamodel.base_models import InputFormat
1416
from docling.datamodel.document import ConversionResult
1517
from docling.datamodel.pipeline_options import (
18+
AcceleratorDevice,
19+
AcceleratorOptions,
1620
AcceleratorDevice,
1721
AcceleratorOptions,
1822
EasyOcrOptions,
@@ -51,7 +55,12 @@ def resolve_ocr_options(
5155
# Third Party
5256
from docling.models.tesseract_ocr_model import TesseractOcrModel
5357

54-
_ = TesseractOcrModel(True, ocr_options)
58+
_ = TesseractOcrModel(
59+
enabled=True,
60+
artifacts_path=docling_model_path,
61+
options=ocr_options,
62+
accelerator_options=AcceleratorOptions(device=AcceleratorDevice.CPU),
63+
)
5564
return ocr_options
5665
except ImportError:
5766
# No tesserocr, so try something else
@@ -66,7 +75,6 @@ def resolve_ocr_options(
6675
recog_network="standard",
6776
download_enabled=True,
6877
)
69-
accelerator_options = AcceleratorOptions(device="cpu")
7078
# triggers torch loading, import lazily
7179
# pylint: disable=import-outside-toplevel
7280
# Third Party
@@ -76,7 +84,7 @@ def resolve_ocr_options(
7684
enabled=True,
7785
artifacts_path=None,
7886
options=ocr_options,
79-
accelerator_options=accelerator_options,
87+
accelerator_options=AcceleratorOptions(device=AcceleratorDevice.CPU),
8088
)
8189
return ocr_options
8290
except ImportError:
@@ -146,6 +154,7 @@ def _init_docling_converter(self):
146154
artifacts_path=self.docling_model_path,
147155
do_ocr=False,
148156
)
157+
149158
# deactivate MPS acceleration on Github CI
150159
if os.getenv("CI") and sys.platform == "darwin":
151160
pipeline_options.accelerator_options = AcceleratorOptions(

src/instructlab/sdg/utils/taxonomy.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
# Standard
44
from pathlib import Path
55
from tempfile import mkdtemp
6-
from typing import Dict, List, Tuple, Union
6+
from typing import Dict, List, Union
77
import glob
88
import logging
99
import os
@@ -122,7 +122,7 @@ def _get_documents(
122122
source: Dict[str, Union[str, List[str]]],
123123
skip_checkout: bool = False,
124124
document_output_dir: Path = None,
125-
) -> Tuple[List[Path], List[Path]]:
125+
) -> List[Path]:
126126
"""
127127
Retrieve file paths (Markdown and PDFs) from a Git repository.
128128
@@ -143,8 +143,8 @@ def _get_documents(
143143
repo_url = source.get("repo")
144144
commit_hash = source.get("commit")
145145
file_patterns = source.get("patterns", [])
146-
147-
try: # pylint: disable=too-many-nested-blocks
146+
# pylint: disable=too-many-nested-blocks
147+
try:
148148
repo = git.Repo.clone_from(repo_url, document_output_dir)
149149

150150
if not skip_checkout and commit_hash:
@@ -178,7 +178,7 @@ def _get_documents(
178178
logger.info(f"Skipping non-file path: {file_path}")
179179

180180
if filepaths:
181-
return filepaths, filepaths
181+
return filepaths
182182
raise SystemExit("Couldn't find knowledge documents")
183183

184184
except (OSError, git.exc.GitCommandError, FileNotFoundError) as e:
@@ -212,13 +212,13 @@ def _read_taxonomy_file(
212212
task_description = contents.get("task_description", None)
213213
domain = contents.get("domain")
214214
documents = contents.get("document")
215-
doc_filepaths, _ = None, None
215+
doc_filepaths = None
216216
if documents:
217217
os.makedirs(document_output_dir, exist_ok=True)
218218
unique_output_dir = mkdtemp(
219219
prefix=f"{leaf_node_path}_", dir=document_output_dir
220220
)
221-
doc_filepaths, _ = _get_documents(
221+
doc_filepaths = _get_documents(
222222
source=documents,
223223
document_output_dir=unique_output_dir,
224224
)

tests/functional/test_chunkers.py

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -90,13 +90,12 @@ def test_chunk_documents(
9090
chunk_word_count=500,
9191
)
9292
chunks = chunker.chunk_documents()
93-
94-
# Check that we have more chunks than expected.
95-
assert (
96-
len(chunks) > expected_chunks
97-
), f"Expected more than {expected_chunks} chunks, got {len(chunks)}"
98-
99-
# Check that no chunk is empty and each chunk's length is within the allowed limit.
93+
assert len(chunks) > expected_chunks
94+
if contains_text:
95+
# Normalize spaces and remove newlines for more flexible text comparison
96+
normalized_chunk = " ".join(chunks[0].replace("\n", " ").split())
97+
normalized_text = " ".join(contains_text.split())
98+
assert normalized_text in normalized_chunk
10099
for chunk in chunks:
101100
assert chunk, "Chunk should not be empty"
102101
assert len(chunk) < 2500, f"Chunk length {len(chunk)} exceeds maximum allowed"

0 commit comments

Comments
 (0)