Skip to content

Commit 51d558a

Browse files
committed
Merge remote-tracking branch 'origin/master'
# Conflicts: # pyproject.toml
2 parents 3c9abae + 79b1e20 commit 51d558a

File tree

1 file changed

+11
-0
lines changed

1 file changed

+11
-0
lines changed

CHANGELOG.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,17 @@
55
* Andrea Sponziello
66
### **Copyright**: Tiledesk SRL
77

8+
---
9+
## [2026-04-02]
10+
### 0.10.0
11+
- Feature: Unified ingestion for all document types (PDF, DOCX, HTML with scraping) through the new endpoint `POST /api/ingestion`. Automatic routing to the appropriate pipeline (OCR, hybrid, default) based on the provided options.
12+
- Feature: Automatic extraction of all HTML tables from web pages in every scraping path (Trafilatura, Playwright+BS4, Playwright+stealth, fallback selector). For each table, a specific document is generated with metadata (columns, index, type), converted to markdown, with structure preserved.
13+
- Feature: Extraction of embedded images for DOCX documents — the loader now finds both modern (`w:drawing`) and legacy (`w:pict`) images, stores them locally, and generates automatic captions with LLM vision. Captions are associated with/around the relevant text paragraphs and referenced in the `"ref_images"` metadata.
14+
- Improvement: Introduced the `CommonChunkMetadata` Pydantic schema as the single metadata contract between pipeline and vectorstore, with safe defaults and forward compatibility.
15+
- Feature: Optional generation of “situated context” via LLM for each ingested chunk, following Anthropic's Contextual Retrieval technique. Allows configuration of LLM provider and model directly from the API call.
16+
- Fix: Resolved bug in hybrid Pinecone upsert caused by unwanted inclusion of `namespace: None` in metadata (which resulted in error 400).
17+
- Fix: Improved chat history management for gpt-5.x and situational context for text content.
18+
- Docs: Updated roadmap and unified ingestion documentation detailing implementation status, pipelines, technical details of image and table extraction, OCR engines, and notes on future developments.
819

920
---
1021
## [2026-04-11]

0 commit comments

Comments
 (0)