Skip to content
View dawsongzhao0523's full-sized avatar

Block or report dawsongzhao0523

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Stars

智能文档

31 repositories

A curated list of resources for Document Understanding (DU) topic

1,366 159 Updated Jun 2, 2023

Convert PDF to markdown + JSON quickly with high accuracy

Python 21,221 1,299 Updated Feb 25, 2025

File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.

Python 5,749 286 Updated Feb 21, 2025

No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents

Python 4,594 361 Updated Feb 25, 2025

Knowledge Agents and Management in the Cloud

Python 3,721 361 Updated Feb 24, 2025

Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks

Python 6,238 513 Updated Nov 3, 2024

There can be more than Notion and Miro. AFFiNE(pronounced [ə‘fain]) is a next-gen knowledge base that brings planning, sorting and creating all together. Privacy first, open-source, customizable an…

TypeScript 45,975 3,038 Updated Feb 25, 2025

Camelot: PDF Table Extraction for Humans

Python 3,674 359 Updated Jan 5, 2023

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

HTML 10,252 852 Updated Feb 20, 2025

Data processing with ML, LLM and Vision LLM

Python 4,369 428 Updated Feb 12, 2025

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。

Python 26,469 2,032 Updated Feb 25, 2025

A realtime serving engine for Data-Intensive Generative AI Applications

Rust 967 125 Updated Feb 25, 2025

#1 Locally hosted web application that allows you to perform various operations on PDF files

Java 52,052 4,296 Updated Feb 25, 2025

🔮 Instill Core is a full-stack AI infrastructure tool for data, model and pipeline orchestration, designed to streamline every aspect of building versatile AI-first applications

Makefile 2,223 110 Updated Feb 24, 2025

A system for agentic LLM-powered data processing and ETL

Python 1,682 152 Updated Feb 23, 2025

Vision model based document ingestion

Rust 1,671 85 Updated Feb 21, 2025

Knowledge Table is an open-source package designed to simplify extracting and exploring structured data from unstructured documents.

Python 512 72 Updated Nov 19, 2024

OCR, layout analysis, reading order, table recognition in 90+ languages

Python 16,374 1,065 Updated Feb 24, 2025

Eclipse RDF4J: scalable RDF for Java

Java 375 166 Updated Feb 14, 2025

PDF scientific paper translation with preserved formats - 基于 AI 完整保留排版的 PDF 文档全文双语翻译,支持 Google/DeepL/Ollama/OpenAI 等服务,提供 CLI/GUI/Docker/Zotero

Python 17,664 1,425 Updated Feb 25, 2025

Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.

Rust 965 37 Updated Dec 21, 2024

Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured …

Python 2,409 194 Updated Feb 4, 2025

Get your documents ready for gen AI

Python 22,098 1,258 Updated Feb 24, 2025

Open-source platform for extracting structured data from documents using AI.

JavaScript 1,253 41 Updated Feb 21, 2025

自然语言处理实验(sougou数据集),TF-IDF,文本分类、聚类、词向量、情感识别、关系抽取等

Python 1,708 767 Updated Jul 18, 2022

pingcap/autoflow is a Graph RAG based and conversational knowledge base tool built with TiDB Serverless Vector Storage. Demo: https://tidb.ai

TypeScript 2,338 129 Updated Feb 24, 2025

This repository includes the official implementation of OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs.

Python 628 61 Updated Feb 7, 2025

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Python 2,114 126 Updated Dec 24, 2024

BISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful and comprehensive features include: GenAI workflow, RAG, Agent, Unified model management, Evaluation,…

Python 7,710 1,299 Updated Feb 25, 2025

PIKE-RAG: sPecIalized KnowledgE and Rationale Augmented Generation

Python 857 74 Updated Feb 20, 2025