[Ideas] AI-Powered Parser Function for Unstructured Data #1442
Unanswered
xtangcode
asked this question in
Ideas / Feature Requests
Replies: 2 comments 1 reply
-
|
To better illustrate how this AI parser feature would function in practice, we’ve developed a simple Python user-defined function (UDF) prototype. This prototype is designed to showcase the core capabilities of the proposed feature by processing 3 sample PDF files, showing how the feature would streamline the conversion of unstructured data into actionable, table-stored JSON—paving the way for seamless integration with AI agents, analytics tools, and other applications. ai_parser.mp4 |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
That's good extension of Directory Table. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Description
We propose adding an AI-driven parser function to Apache Cloudberry that converts unstructured data from diverse formats into structured JSON and enables seamless storage in tables. This function will serve as a critical bridge between unstructured data sources and structured systems, with core capabilities such as:
Broad Format Support: Natively handle common unstructured formats, including:
LLM Agnosticism: Allow users to integrate their preferred large language models (LLMs) or vision-language models (VLMs), whether open-source (e.g., Llama 3, Mistral, Vicuna) or proprietary (e.g., GPT-4, Claude, PaLM). This flexibility ensures compatibility with existing user workflows, privacy requirements, and cost constraints.
The parser will extract context-aware structured data (e.g., key-value pairs, tables, entities, relationships) from unstructured sources, normalize it into JSON (with auto-generated schemas or user-defined schema overrides), and persist the output in Cloudberry tables for querying, analytics, or integration with downstream agents, tools or apps.
Use case/motivation
Unstructured data (documents, images, etc.) is a cornerstone of modern data ecosystems, but its lack of structure blocks seamless integration with AI agents, analytics pipelines, and downstream applications. This feature addresses this gap, with key use cases:
Industry tools like Databricks’ ai_parse_document and Snowflake Cortex’s parse_document validate this need, but Apache Cloudberry’s open-source nature and focus on user choice make LLM agnosticism a differentiator—ensuring the feature serves a broader, more diverse user base.
Related issues
No response
Are you willing to submit a PR?
Beta Was this translation helpful? Give feedback.
All reactions