Skip to content

Releases: Unstructured-IO/unstructured

0.22.18

08 Apr 14:02
d299095

Choose a tag to compare

What's Changed

  • fix(chunking): preserve semantic headers in carried table chunks by @cragwolfe in #4313
  • feat: add page number support to v1 html partition by @badGarnet in #4327

Full Changelog: 0.22.16...0.22.18

0.22.16

03 Apr 20:44
264d569

Choose a tag to compare

0.22.16

Enhancements

  • Formula markdown export (element_to_md / elements_to_md): New keyword-only formula_markdown_style ("auto", "display_math", "plain"; default "auto"). In "auto", display math ($$ ... $$) is used only when the text looks like notation (heuristic score) and contains no $/$$ (avoids breaking Markdown and noisy OCR captions). "display_math" wraps whenever safe (still falls back to plain if $ would corrupt fences). "plain" emits text only. Optional normalize_formula (default True) maps common Unicode operators to LaTeX-like tokens; normalize_formula stays before keyword-only options so positional encoding / no_group_by_page callers are unchanged. Unicode is never mapped to \\sqrt{}. Module constants: FORMULA_MARKDOWN_AUTO, FORMULA_MARKDOWN_DISPLAY_MATH, FORMULA_MARKDOWN_PLAIN.

0.22.15

Security

  • security: fix(deps): upgrade vulnerable transitive dependencies [security]

0.22.14

Enhancements

  • Deduplicate PDF rendering: Remove _render_pdf_pages and delegate to unstructured-inference's convert_pdf_to_image (which already has lazy per-page rendering). Peak memory for path_only=True drops from O(n_pages) to O(1 page) — 97% reduction on a 100-page PDF. Bumps inference dep to >=1.6.2.

0.22.13

Enhancements

  • Speed up standardize_quotes: Replace loop-based character replacement with a single str.translate() call using a pre-computed translation table. Also fixes a pre-existing bug where left smart quotes were never normalized due to duplicate dictionary keys.

0.22.12

02 Apr 16:27
6ada488

Choose a tag to compare

What's Changed

  • mem: exclude unused spaCy pipeline components to reduce model memory by @KRRT7 in #4296
  • fix: pdfminer drops extractable text by @qued in #4310

Full Changelog: 0.22.10...0.22.12

0.22.10

31 Mar 15:50
b6cf510

Choose a tag to compare

What's Changed

  • fix(chunking): preserve nested table structure in reconstruction by @cragwolfe in #4301
  • Replace lazyproperty with functools.cached_property by @KRRT7 in #4282
  • mem: reduce PaddleOCR rec_batch_num from 6 to 1 by @KRRT7 in #4295
  • fix: isolate Table elements in pre-chunks by @claytonlin1110 in #4307
  • feat(chunking): repeat table headers on continuation chunks by @cragwolfe in #4298

Full Changelog: 0.22.6...0.22.10

0.22.6

26 Mar 21:20
b0e86a4

Choose a tag to compare

What's Changed

  • fix(deps): Update security updates [SECURITY] by @utic-renovate[bot] in #4303
  • fix: Self-contained script for version extraction in release CI by @vladimir-kivi-ds in #4304

Full Changelog: 0.22.4...0.22.6

0.22.4

26 Mar 19:32
78dfb30

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 0.21.5...0.22.4

0.21.5

24 Feb 15:28
5302352

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 0.21.2...0.21.5

0.21.2

23 Feb 01:16
4a77a8c

Choose a tag to compare

fix: self-install pinned spaCy model at runtime with SHA256 verificat…

0.21.1

22 Feb 20:48
47b8b5e

Choose a tag to compare

What's Changed

Full Changelog: 0.21.0...0.21.1

0.21.0

22 Feb 19:36
3db7b4f

Choose a tag to compare

0.21.0

Fixes

  • Replace NLTK with spaCy to remediate CVE-2025-14009: NLTK's downloader uses zipfile.extractall() without path validation, enabling RCE via malicious packages (CVSS 10.0, no patch available). spaCy models install as pip packages, eliminating the vulnerable downloader entirely.