Skip to content

Commit

Permalink
v0.3.7 readme
Browse files Browse the repository at this point in the history
  • Loading branch information
bigbrother666sh committed Jan 18, 2025
1 parent 3e07d63 commit 7ac3b6f
Show file tree
Hide file tree
Showing 8 changed files with 199 additions and 109 deletions.
22 changes: 22 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,25 @@
# V0.3.7

- 新增通过wxbot方案获取微信公众号订阅消息信源(不是很优雅,但已是目前能找到的最佳方案)

Added WeChat Official Account subscription message source acquisition through wxbot solution (not very elegant, but currently the best solution available)

- 升级适配 Crawl4ai 0.4.247 版本,

Upgraded to fit Crawl4ai 0.4.247 version,

- 通过新增预处理流程以及全新设计的推荐链接提取策略,大幅提升信息抓取效果,现在7b 这样的小模型也能比较好的完成复杂关注点(explanation中包含时间、指标限制这种)的提取了。

Through the addition of a new pre-processing process and a completely redesigned recommended link extraction strategy, the information capture effect has been significantly improved, and now even small models like 7b can better complete the extraction of complex focus points (such as time and index limits in the explanation).

- 提供自定义提取器接口,方便用户根据实际需求进行定制。

Provided a custom extractor interface to allow users to customize according to actual needs.

- bug 修复以及其他改进(crawl4ai浏览器生命周期管理,异步 llm wrapper 等)

Bug fixes and other improvements (crawl4ai browser lifecycle management, asynchronous llm wrapper, etc.)

# V0.3.6
- 改用 Crawl4ai 作为底层爬虫框架,其实Crawl4ai 和 Crawlee 的获取效果差别不大,二者也都是基于 Playwright ,但 Crawl4ai 的 html2markdown 功能很实用,而这对llm 信息提取作用很大,另外 Crawl4ai 的架构也更加符合我的思路;
- 在 Crawl4ai 的 html2markdown 基础上,增加了 deep scraper,进一步把页面的独立链接与正文进行区分,便于后一步 llm 的精准提取。由于html2markdown和deep scraper已经将原始网页数据做了很好的清理,极大降低了llm所受的干扰和误导,保证了最终结果的质量,同时也减少了不必要的 token 消耗;
Expand Down
42 changes: 16 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,60 +2,49 @@

**[English](README_EN.md) | [日本語](README_JP.md) | [한국어](README_KR.md)**

🚀 **首席情报官**(Wiseflow)是一个敏捷的信息挖掘工具,可以从各种给定信源中依靠大模型的思考与分析能力精准抓取特定信息,全程无需人工参与。
🚀 **AI情报官**(Wiseflow)是一个敏捷的信息挖掘工具,可以从各种给定信源中依靠大模型的思考与分析能力精准抓取特定信息,全程无需人工参与。

**我们缺的不是信息,而是从海量信息中过滤噪音,从而让有价值的信息显露出来**

🌱看看AI情报官是如何帮您节省时间,过滤无关信息,并整理关注要点的吧!🌱

https://github.com/user-attachments/assets/fc328977-2366-4271-9909-a89d9e34a07b

## 🔥 虽迟但到, V0.3.6来了
## 🔥 V0.3.7 来了

V0.3.6 是 V0.3.5的效果改进版本,针对诸多社区反馈进行了改进,建议所有用户升级。
本次升级带来了 wxbot 的整合方案,方便大家添加微信公众号作为信源,具体见 [weixin_mp/README.md](./weixin_mp/README.md)

- 改用 Crawl4ai 作为底层爬虫框架,其实Crawl4ai 和 Crawlee 的获取效果差别不大,二者也都是基于 Playwright ,但 Crawl4ai 的 html2markdown 功能很实用,而这对llm 信息提取作用很大,另外 Crawl4ai 的架构也更加符合我的思路;
- 在 Crawl4ai 的 html2markdown 基础上,增加了 deep scraper,进一步把页面的独立链接与正文进行区分,便于后一步 llm 的精准提取。由于html2markdown和deep scraper已经将原始网页数据做了很好的清理,极大降低了llm所受的干扰和误导,保证了最终结果的质量,同时也减少了不必要的 token 消耗;
我们也提供了专门针对微信公众号文章的提取器,同时也设计了自定义提取器接口,方便用户根据实际需求进行定制。

*列表页面和文章页面的区分是所有爬虫类项目都头痛的地方,尤其是现代网页往往习惯在文章页面的侧边栏和底部增加大量推荐阅读,使得二者几乎不存在文本统计上的特征差异。*
*这一块我本来想用视觉大模型进行 layout 分析,但最终实现起来发现获取不受干扰的网页截图是一件会极大增加程序复杂度并降低处理效率的事情……*

- 重构了提取策略、llm 的 prompt 等;
本次升级也进一步强化了信息提取能力,不仅极大优化了页面中链接的分析,还使得7b、14b 这种规模的模型也能比较好的完成基于复杂关注点(explanation中包含时间、指标限制这种)的提取。

另外本次升级还适配了 Crawl4ai 0.4.247 版本,以及做了诸多程序改进,具体见 [CHANGELOG.md](./CHANGELOG.md)

*有关 prompt 我想说的是,我理解好的 prompt 是清晰的工作流指导,每一步都足够明确,明确到很难犯错。但我不太相信过于复杂的 prompt 的价值,这个很难评估,如果你有更好的方案,欢迎提供 PR*
感谢如下社区贡献者在这一阶段的 PR:

- 引入视觉大模型,自动在提取前对高权重(目前由 Crawl4ai 评估权重)图片进行识别,并补充相关信息到页面文本中;
- 继续减少 requirement.txt 的依赖项,目前不需要 json_repair了(实践中也发现让 llm 按 json 格式生成,还是会明显增加处理时间和失败率,因此我现在采用更简单的方式,同时增加对处理结果的后处理)
- pb info 表单的结构做了小调整,增加了 web_title 和 reference 两项。
- @ourines 贡献了 install_pocketbase.sh 脚本 (docker运行方案被暂时移除了,感觉大家用起来也不是很方便……)
- @ibaoger 贡献了 windows 下的pocketbase 安装脚本
- @tusik 贡献了异步 llm wrapper

**升级V0.3.6 版本依然需要重构 pocketbase 数据库,请删除pb/pb_data 文件夹后重新执行**

**V0.3.6版本 .env 中需要把SECONDARY_MODEL替换为VL_MODEL,请参考最新的 [env_sample](./env_sample)**
**V0.3.7版本再次引入SECONDARY_MODEL,这主要是为了降低使用成本**

### V0.3.6 测试报告
### V0.3.7 测试报告

我们在四个现实案例任务以及共计六个真实网页 sample 中横向测试并比较了由 siliconflow 提供的deepseekV2.5、Qwen2.5-32B-Instruct、Qwen2.5-14B-Instruct、Qwen2.5-72B-Instruct 模型的表现情况,
测试结果请参考 [report](./test/reports/wiseflow_report_v036_bigbrother666/README.md)
在最新的提取策略下,我们发现7b 这种规模的模型也能很好的执行链接分析与提取任务,测试结果请参考 [report](./test/reports/wiseflow_report_v037_bigbrother666/README.md)

同时我们也将测试脚本进行开源,欢迎大家踊跃提交更多测试结果,wiseflow 是一个开源项目,希望通过大家共同的贡献,打造“人人可用的信息爬取工具”!
不过信息总结任务目前还是推荐大家使用不低于 32b 规模的模型,具体推荐请参考最新的 [env_sample](./env_sample)

具体请参考 [test/README.md](./test/README.md)
继续欢迎大家提交更多测试结果,共同探索 wiseflow 在各种信源下的最佳使用方案。

现阶段,**提交测试结果等同于提交项目代码**,同样会被接纳为contributor,甚至受邀参加商业化项目!
现阶段,**提交测试结果等同于提交项目代码**,同样会被接纳为contributor,甚至受邀参加商业化项目!具体请参考 [test/README.md](./test/README.md)


🌟**V0.3.x 计划**

- 尝试支持微信公众号免wxbot订阅(V0.3.7);
- ~~尝试支持微信公众号免wxbot订阅(V0.3.7);【已完成】~~
- 引入对 RSS 信息源和搜索引擎的支持(V0.3.8);
- 尝试部分支持社交平台(V0.3.9)。

伴随着上述三个版本,我会持续改进 deep scraper 以及 llm 提取策略,也欢迎大家持续反馈应用场景和抽取效果不理想的信源地址,欢迎在 [issue #136](https://github.com/TeamWiseFlow/wiseflow/issues/136) 中进行反馈。


## ✋ wiseflow 与传统的爬虫工具、AI搜索、知识库(RAG)项目有何不同?

wiseflow自2024年6月底发布 V0.3.0版本来受到了开源社区的广泛关注,甚至吸引了不少自媒体的主动报道,在此首先表示感谢!
Expand Down Expand Up @@ -260,6 +249,7 @@ PocketBase作为流行的轻量级数据库,目前已有 Go/Javascript/Python
## 🤝 本项目基于如下优秀的开源项目:

- crawl4ai(Open-source LLM Friendly Web Crawler & Scraper) https://github.com/unclecode/crawl4ai
- pocketbase (Open Source realtime backend in 1 file) https://github.com/pocketbase/pocketbase
- python-pocketbase (pocketBase client SDK for python) https://github.com/vaphes/pocketbase

本项目开发受 [GNE](https://github.com/GeneralNewsExtractor/GeneralNewsExtractor)[AutoCrawler](https://github.com/kingname/AutoCrawler)[SeeAct](https://github.com/OSU-NLP-Group/SeeAct) 启发。
Expand Down
46 changes: 19 additions & 27 deletions README_EN.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,58 +2,49 @@

**[简体中文](README.md) | [日本語](README_JP.md) | [한국어](README_KR.md)**

🚀 **Chief Intelligence Officer** (Wiseflow) is an agile information mining tool that can precisely extract specific information from various given sources by leveraging the thinking and analytical capabilities of large models, requiring no human intervention throughout the process.
🚀 **AI Intelligence Officer** (Wiseflow) is an agile information mining tool that can precisely extract specific information from various given sources by leveraging the thinking and analytical capabilities of large models, requiring no human intervention throughout the process.

**What we lack is not information, but the ability to filter out noise from massive information, thereby revealing valuable information.**

🌱 See how AI Intelligence Officer helps you save time, filter irrelevant information, and organize key points of interest! 🌱

https://github.com/user-attachments/assets/fc328977-2366-4271-9909-a89d9e34a07b

## 🔥 V0.3.6 is Here
## 🔥 V0.3.7 is Here

V0.3.6 is an enhanced version of V0.3.5, incorporating numerous improvements based on community feedback. We recommend all users to upgrade.
This upgrade brings wxbot integration solution, making it convenient for everyone to add WeChat Official Accounts as information sources. For details, see [weixin_mp/README.md](./weixin_mp/README.md)

- Switched to Crawl4ai as the underlying web crawling framework. Although Crawl4ai and Crawlee both rely on Playwright with similar fetching results, Crawl4ai's html2markdown feature is quite practical for LLM information extraction. Additionally, Crawl4ai's architecture better aligns with my design philosophy.
- Built upon Crawl4ai's html2markdown, we added a deep scraper to further differentiate standalone links from the main content, facilitating more precise LLM extraction. The preprocessing done by html2markdown and deep scraper significantly cleans up raw web data, minimizing interference and misleading information for LLMs, ensuring higher quality outcomes while reducing unnecessary token consumption.
We have also provided extractors specifically designed for WeChat Official Account articles, while also designing custom extractor interfaces to allow users to customize according to their actual needs.

*Distinguishing between list pages and article pages is a common challenge in web scraping projects, especially when modern webpages often include extensive recommended readings in sidebars and footers of articles, making it difficult to differentiate them through text statistics.*
*Initially, I considered using large visual models for layout analysis, but found that obtaining undistorted webpage screenshots greatly increases program complexity and reduces processing efficiency...*
This upgrade further strengthens information extraction capabilities, not only greatly optimizing the analysis of links within pages but also enabling models of 7b and 14b scale to better complete extractions based on complex focus points (such as those containing time and metric restrictions in explanations).

- Restructured extraction strategies and LLM prompts;
Additionally, this upgrade adapts to Crawl4ai version 0.4.247 and makes many program improvements. For details, see [CHANGELOG.md](./CHANGELOG.md)

*Regarding prompts, I believe that a good prompt serves as clear workflow guidance, with each step being explicit enough to minimize errors. However, I am skeptical about the value of overly complex prompts, which are hard to evaluate. If you have better solutions, feel free to submit a PR.*
Thanks to the following community contributors for their PRs during this phase:

- Introduced large visual models to automatically recognize high-weight images (currently evaluated by Crawl4ai) before extraction and append relevant information to the page text;
- Continued to reduce dependencies in requirement.txt; json_repair is no longer needed (in practice, having LLMs generate JSON format still noticeably increases processing time and failure rates, so I now adopt a simpler approach with additional post-processing of results)
- Made minor adjustments to the pb info form structure, adding web_title and reference fields.
- @ourines contributed the install_pocketbase.sh script (the Docker running solution has been temporarily removed as it wasn't very convenient for users...)
- @ibaoger contributed the install_pocketbase.ps1 script for windows users
- @tusik contributed the asynchronous llm wrapper
- @ourines contributed the install_pocketbase.sh script (docker running solution has been temporarily removed as it wasn't very convenient for users...)
- @ibaoger contributed the pocketbase installation script for Windows
- @tusik contributed the asynchronous llm wrapper

**Upgrading to V0.3.6 requires restructuring the PocketBase database. Please delete the pb/pb_data folder and re-run the setup**

**In V0.3.6, replace SECONDARY_MODEL with VL_MODEL in the .env file. Refer to the latest [env_sample](./env_sample)**

### V0.3.6 Test Report
**V0.3.7 version reintroduces SECONDARY_MODEL, mainly to reduce usage costs**

### V0.3.7 Test Report

We conducted horizontal tests across four real-world tasks and six real web samples using deepseekV2.5, Qwen2.5-32B-Instruct, Qwen2.5-14B-Instruct, and Qwen2.5-72B-Instruct models provided by siliconflow. For detailed test results, please refer to [report](./test/reports/wiseflow_report_v036_bigbrother666/README.md).
Under the latest extraction strategy, we found that models of 7b scale can also perform link analysis and extraction tasks well. For test results, please refer to [report](./test/reports/wiseflow_report_v037_bigbrother666/README.md)

We have also open-sourced our testing scripts. We welcome everyone to submit more test results. Wiseflow is an open-source project aiming to create an "information retrieval tool accessible to everyone"!
However, for information summarization tasks, we still recommend using models no smaller than 32b scale. For specific recommendations, please refer to the latest [env_sample](./env_sample)

Refer to [test/README.md](./test/README.md)
We continue to welcome more test results to jointly explore the best usage solutions for wiseflow under various information sources.

At this stage, **submitting test results is equivalent to contributing code**, and contributors may even be invited to participate in commercial projects!
At this stage, **submitting test results is equivalent to submitting project code**, and will similarly be accepted as a contributor, and may even be invited to participate in commercialization projects! For details, please refer to [test/README.md](./test/README.md)


🌟**V0.3.x Roadmap**

- Attempt to support WeChat Official Account subscription without wxbot (V0.3.7);
- ~~Attempt to support WeChat Official Account subscription without wxbot (V0.3.7);~~
- Introduce support for RSS feeds and search engines (V0.3.8);
- Attempt partial support for social platforms (V0.3.9).

Throughout these versions, I will continuously improve the deep scraper and LLM extraction strategies. We welcome continuous feedback on application scenarios and sources where extraction performance is unsatisfactory. Please provide feedback in [issue #136](https://github.com/TeamWiseFlow/wiseflow/issues/136).


## ✋ How is wiseflow Different from Traditional Crawler Tools, AI Search, and Knowledge Base (RAG) Projects?

Expand Down Expand Up @@ -260,6 +251,7 @@ If you have any questions or suggestions, please feel free to leave a message vi
## 🤝 This Project is Based on the Following Excellent Open-Source Projects:

- crawl4ai(Open-source LLM Friendly Web Crawler & Scraper) https://github.com/unclecode/crawl4ai
- pocketbase (Open Source realtime backend in 1 file) https://github.com/pocketbase/pocketbase
- python-pocketbase (pocketBase client SDK for python) https://github.com/vaphes/pocketbase

Also inspired by [GNE](https://github.com/GeneralNewsExtractor/GeneralNewsExtractor) [AutoCrawler](https://github.com/kingname/AutoCrawler) [SeeAct](https://github.com/OSU-NLP-Group/SeeAct) .
Expand Down
Loading

0 comments on commit 7ac3b6f

Please sign in to comment.