dawsongzhao0523

zhaodongsheng dawsongzhao0523

15 followers · 36 following

Achievements

Stars

智能文档

31 repositories

tstanislawek / awesome-document-understanding

A curated list of resources for Document Understanding (DU) topic

1,366 159 Updated Jun 2, 2023

VikParuchuri / marker

Convert PDF to markdown + JSON quickly with high accuracy

Python 21,221 1,299 Updated Feb 25, 2025

QuivrHQ / MegaParse

File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.

Python 5,749 286 Updated Feb 21, 2025

Zipstack / unstract

No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents

Python 4,594 361 Updated Feb 25, 2025

run-llama / llama_cloud_services

Knowledge Agents and Management in the Cloud

Python 3,721 361 Updated Feb 24, 2025

adithya-s-k / omniparse

Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks

Python 6,238 513 Updated Nov 3, 2024

toeverything / AFFiNE

There can be more than Notion and Miro. AFFiNE(pronounced [ə‘fain]) is a next-gen knowledge base that brings planning, sorting and creating all together. Privacy first, open-source, customizable an…

TypeScript 45,975 3,038 Updated Feb 25, 2025

atlanhq / camelot

Camelot: PDF Table Extraction for Humans

Python 3,674 359 Updated Jan 5, 2023

Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

HTML 10,252 852 Updated Feb 20, 2025

katanaml / sparrow

Data processing with ML, LLM and Vision LLM

Python 4,369 428 Updated Feb 12, 2025

opendatalab / MinerU

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具，将PDF转换成Markdown和JSON格式。

Python 26,469 2,032 Updated Feb 25, 2025

tensorlakeai / indexify

A realtime serving engine for Data-Intensive Generative AI Applications

Rust 967 125 Updated Feb 25, 2025

Stirling-Tools / Stirling-PDF

#1 Locally hosted web application that allows you to perform various operations on PDF files

Java 52,052 4,296 Updated Feb 25, 2025

instill-ai / instill-core

🔮 Instill Core is a full-stack AI infrastructure tool for data, model and pipeline orchestration, designed to streamline every aspect of building versatile AI-first applications

Makefile 2,223 110 Updated Feb 24, 2025

ucbepic / docetl

A system for agentic LLM-powered data processing and ETL

Python 1,682 152 Updated Feb 23, 2025

lumina-ai-inc / chunkr

Vision model based document ingestion

Rust 1,671 85 Updated Feb 21, 2025

whyhow-ai / knowledge-table

Knowledge Table is an open-source package designed to simplify extracting and exploring structured data from unstructured documents.

Python 512 72 Updated Nov 19, 2024

VikParuchuri / surya

OCR, layout analysis, reading order, table recognition in 90+ languages

Python 16,374 1,065 Updated Feb 24, 2025

eclipse-rdf4j / rdf4j

Eclipse RDF4J: scalable RDF for Java

Java 375 166 Updated Feb 14, 2025

Byaidu / PDFMathTranslate

PDF scientific paper translation with preserved formats - 基于 AI 完整保留排版的 PDF 文档全文双语翻译，支持 Google/DeepL/Ollama/OpenAI 等服务，提供 CLI/GUI/Docker/Zotero

Python 17,664 1,425 Updated Feb 25, 2025

yobix-ai / extractous

Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.

Rust 965 37 Updated Dec 21, 2024

CatchTheTornado / text-extract-api

Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured …

Python 2,409 194 Updated Feb 4, 2025

DS4SD / docling

Get your documents ready for gen AI

Python 22,098 1,258 Updated Feb 24, 2025

DocumindHQ / documind

Open-source platform for extracting structured data from documents using AI.

JavaScript 1,253 41 Updated Feb 21, 2025

Roshanson / TextInfoExp

自然语言处理实验（sougou数据集），TF-IDF，文本分类、聚类、词向量、情感识别、关系抽取等

Python 1,708 767 Updated Jul 18, 2022

pingcap / autoflow

pingcap/autoflow is a Graph RAG based and conversational knowledge base tool built with TiDB Serverless Vector Storage. Demo: https://tidb.ai

TypeScript 2,338 129 Updated Feb 24, 2025

AkariAsai / OpenScholar

This repository includes the official implementation of OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs.

Python 628 61 Updated Feb 7, 2025

X-PLUG / mPLUG-DocOwl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Python 2,114 126 Updated Dec 24, 2024

dataelement / bisheng

BISHENG is an open LLM devops platform for next generation Enterprise AI applications. Powerful and comprehensive features include: GenAI workflow, RAG, Agent, Unified model management, Evaluation,…

Python 7,710 1,299 Updated Feb 25, 2025

microsoft / PIKE-RAG

PIKE-RAG: sPecIalized KnowledgE and Rationale Augmented Generation

Python 857 74 Updated Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zhaodongsheng dawsongzhao0523

Achievements

Achievements

Block or report dawsongzhao0523

智能文档

tstanislawek / awesome-document-understanding

VikParuchuri / marker

QuivrHQ / MegaParse

Zipstack / unstract

run-llama / llama_cloud_services

adithya-s-k / omniparse

toeverything / AFFiNE

atlanhq / camelot

Unstructured-IO / unstructured

katanaml / sparrow

opendatalab / MinerU

tensorlakeai / indexify

Stirling-Tools / Stirling-PDF

instill-ai / instill-core

ucbepic / docetl

lumina-ai-inc / chunkr

whyhow-ai / knowledge-table

VikParuchuri / surya

eclipse-rdf4j / rdf4j

Byaidu / PDFMathTranslate

yobix-ai / extractous

CatchTheTornado / text-extract-api

DS4SD / docling

DocumindHQ / documind

Roshanson / TextInfoExp

pingcap / autoflow

AkariAsai / OpenScholar

X-PLUG / mPLUG-DocOwl

dataelement / bisheng

microsoft / PIKE-RAG