[Ideas] AI-Powered Parser Function for Unstructured Data #1442

xtangcode · 2025-11-19T10:08:41Z

xtangcode
Nov 19, 2025
Collaborator

Description

We propose adding an AI-driven parser function to Apache Cloudberry that converts unstructured data from diverse formats into structured JSON and enables seamless storage in tables. This function will serve as a critical bridge between unstructured data sources and structured systems, with core capabilities such as:

Broad Format Support: Natively handle common unstructured formats, including:
- PDFs (both text-based and scanned, with OCR for image-heavy files).
- Microsoft Word documents (.docx).
- Images/figures (e.g., PNG, JPG, SVG) containing text, charts, or diagrams.
- Plain text files and markdown.
LLM Agnosticism: Allow users to integrate their preferred large language models (LLMs) or vision-language models (VLMs), whether open-source (e.g., Llama 3, Mistral, Vicuna) or proprietary (e.g., GPT-4, Claude, PaLM). This flexibility ensures compatibility with existing user workflows, privacy requirements, and cost constraints.

The parser will extract context-aware structured data (e.g., key-value pairs, tables, entities, relationships) from unstructured sources, normalize it into JSON (with auto-generated schemas or user-defined schema overrides), and persist the output in Cloudberry tables for querying, analytics, or integration with downstream agents, tools or apps.

Use case/motivation

Unstructured data (documents, images, etc.) is a cornerstone of modern data ecosystems, but its lack of structure blocks seamless integration with AI agents, analytics pipelines, and downstream applications. This feature addresses this gap, with key use cases:

Enterprise Data Pipelines: Teams can parse invoices (PDF), contracts (Word), or HR documents to extract critical fields (dates, amounts, employee IDs) into JSON, storing them in Cloudberry tables for ERP integration or automated reporting. By supporting user-chosen LLMs, organizations with strict privacy policies (e.g., healthcare, finance) can use on-premises open-source models instead of proprietary cloud LLMs.
Research & Academic Workflows: Scientists can convert scanned lab reports (PDF) or experimental figures (images) into structured data (e.g., results, methodologies, chart values) to feed into AI agents for literature reviews or cross-study analysis. Flexibility in LLMs lets researchers use specialized models trained on scientific text (e.g., BioLlama) for higher accuracy.
Open-Source Ecosystem Integration: Apache Cloudberry’s community users (e.g., startups, nonprofits) often rely on cost-effective or custom-trained LLMs. This parser’s LLM-agnostic design lets them leverage their existing model investments (e.g., a fine-tuned Mistral model) to process unstructured data without vendor lock-in.
AI Agent Enablement: By structuring unstructured data into JSON/tables, the parser empowers Cloudberry-integrated AI agents to access and act on diverse data (e.g., technical manuals, customer feedback) without manual preprocessing, unlocking automation for support, content moderation, and more.

Industry tools like Databricks’ ai_parse_document and Snowflake Cortex’s parse_document validate this need, but Apache Cloudberry’s open-source nature and focus on user choice make LLM agnosticism a differentiator—ensuring the feature serves a broader, more diverse user base.

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

xtangcode · 2025-11-19T10:18:19Z

xtangcode
Nov 19, 2025
Collaborator Author

To better illustrate how this AI parser feature would function in practice, we’ve developed a simple Python user-defined function (UDF) prototype. This prototype is designed to showcase the core capabilities of the proposed feature by processing 3 sample PDF files, showing how the feature would streamline the conversion of unstructured data into actionable, table-stored JSON—paving the way for seamless integration with AI agents, analytics tools, and other applications.

ai_parser.mp4

0 replies

yjhjstz · 2025-11-19T23:34:27Z

yjhjstz
Nov 19, 2025
Collaborator

That's good extension of Directory Table.

1 reply

xtangcode Nov 20, 2025
Collaborator Author

Totally agree

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Ideas] AI-Powered Parser Function for Unstructured Data #1442

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[Ideas] AI-Powered Parser Function for Unstructured Data #1442

Uh oh!

xtangcode Nov 19, 2025 Collaborator

Description

Use case/motivation

Related issues

Are you willing to submit a PR?

Replies: 2 comments · 1 reply

Uh oh!

xtangcode Nov 19, 2025 Collaborator Author

Uh oh!

yjhjstz Nov 19, 2025 Collaborator

Uh oh!

xtangcode Nov 20, 2025 Collaborator Author

xtangcode
Nov 19, 2025
Collaborator

Replies: 2 comments 1 reply

xtangcode
Nov 19, 2025
Collaborator Author

yjhjstz
Nov 19, 2025
Collaborator

xtangcode Nov 20, 2025
Collaborator Author