Skip to content

Commit c72a1f2

Browse files
ravwojdyla-agentravwojdylaclaude
authored
datakit: biodiversity stitch_pages must be a generator (#5451)
* `stitch_pages` is now a generator (`yield`) instead of returning `list[dict]` * zephyr's `_reduce_gen` only flattens reducer output when the reducer is a generator function — a regular function returning `list[dict]` emits the list itself as a single record * inline docstring note documents the constraint so the next reader doesn't re-introduce the bug 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Rafal Wojdyla <ravwojdyla@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent b715527 commit c72a1f2

1 file changed

Lines changed: 11 additions & 9 deletions

File tree

lib/marin/src/marin/datakit/download/biodiversity.py

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -36,26 +36,28 @@
3636
SOURCE_NAME = "biodiversity-heritage-library"
3737

3838

39-
def stitch_pages(item_id: str, pages: Iterator[dict]) -> list[dict]:
39+
def stitch_pages(item_id: str, pages: Iterator[dict]) -> Iterator[dict]:
4040
"""Reducer: join ordered page texts into one item-level record.
4141
4242
Pages arrive sorted by ``page_num`` via the group_by ``sort_by`` key.
4343
Empty page texts are dropped; items left with no usable pages emit
4444
nothing and are counted under ``biodiversity/dropped_items``.
45+
46+
Must be a generator: Zephyr's ``_reduce_gen`` only flattens reducer
47+
output when the reducer is a generator function; a regular function
48+
returning ``list[dict]`` would emit the list as a single record.
4549
"""
4650
texts = [str(p["text"]) for p in pages if p.get("text")]
4751
if not texts:
4852
counters.increment("biodiversity/dropped_items")
49-
return []
53+
return
5054
counters.increment("biodiversity/kept_items")
5155
counters.increment("biodiversity/pages_stitched", len(texts))
52-
return [
53-
{
54-
"text": PAGE_SEPARATOR.join(texts),
55-
"source": SOURCE_NAME,
56-
"item_id": item_id,
57-
}
58-
]
56+
yield {
57+
"text": PAGE_SEPARATOR.join(texts),
58+
"source": SOURCE_NAME,
59+
"item_id": item_id,
60+
}
5961

6062

6163
def transform(input_path: str, output_path: str) -> None:

0 commit comments

Comments
 (0)