ragflow on htmls #8082

xyzhu68 · 2025-06-05T07:50:55Z

xyzhu68
Jun 5, 2025

I use a cad software, which has lots of help files in html format. The html files are structured, means, there are sub directories to categorize different themes of cad.
Is it possible to use all these html files (maybe several thousands) in a knowledg base? Can ragflow parse html with js correctly?
Or do you habe some suggestions to handle html files for rag?
Thank you!

xyzhu68 · 2025-06-05T07:52:40Z

dosubot[bot]
bot Jun 5, 2025

RAGFlow can handle large numbers of HTML files in a knowledge base, but it only parses the static HTML content. The built-in HTML parser extracts the title and main text using the readability library, so it does not execute or process JavaScript—meaning any content generated or modified by JavaScript in your help files will not be included in the knowledge base extraction process see source.

If your help files rely on JavaScript to render important content, you’ll need to preprocess them with an external tool (like a headless browser or site scraper that can render JS and output static HTML) before importing them into RAGFlow.

For structured directories, you can batch upload the HTML files. Just make sure they’re parsed after upload—parsing is required before the files become searchable in RAGFlow see source. Also, RAGFlow currently requires you to specify the file parsing method when creating the knowledge base, which can be limiting if you have mixed file types, but there are suggestions to make this more flexible in the future see source.

If you need to include dynamic content, preprocess your HTML files to flatten them into static HTML before adding them to RAGFlow. Otherwise, for static help files, you can use them directly.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

4 replies

xyzhu68 Jun 6, 2025
Author

Hi, I uplaoded over 7700 Html files. But after I clicked on OK after hours of uploading, I get the error "RequestEntityTooLarge '413'". What is the limit of file number? Can I increase the limit? The ragflow is running as localhost.

dosubot[bot] bot Jun 6, 2025

The "RequestEntityTooLarge 413" error happens because RAGFlow enforces limits on both the total upload size and the number of files per batch. For local deployments, the default batch upload limit is 32 files at a time, with a total size limit of 1GB per upload. There’s no cap on the total number of files in your account, but you’ll need to split your 7700 HTML files into smaller batches to upload them successfully.

If you want to increase the file size limit, edit docker/.env to set MAX_CONTENT_LENGTH (e.g., 1073741824 for 1GB) and update client_max_body_size in nginx/nginx.conf to match. The 32-file batch limit is not recommended to change manually, but if you use the HTTP API or Python SDK for uploads, this batch limit is removed and you can automate large-scale uploads more efficiently. More details are in the FAQ and docker/.env files.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

xyzhu68 Jun 6, 2025
Author

if I use HTTP API to upload files, can I use wildcard character like "" in Request (e.g. --form 'file=@./.html')? How can I reserve the folder structure of html files?

dosubot[bot] bot Jun 6, 2025

You can't use wildcard characters like * directly in the HTTP API request (e.g., --form 'file=@./*.html'), since the API expects each file to be specified individually in the multipart form data—wildcards are expanded by your shell, not by the API itself. To upload many files, you'll need to enumerate them explicitly (for example, with a shell loop or a tool that supports multiple --form entries).

RAGFlow's HTTP API also doesn't preserve the original folder structure of your files during upload. All files are treated as flat uploads, and their directory paths are not stored or reconstructed in the knowledge base. If you need to keep track of the folder structure, you could encode the relative path into the filename before upload (like subdir1_subdir2_file.html), or upload a zip file and extract it server-side (though native zip extraction isn't supported out of the box).

In short: enumerate files for upload, and if folder structure is important, consider encoding it in filenames or handling it in your own workflow.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

ZhenhangTung · 2025-06-13T02:50:58Z

ZhenhangTung
Jun 13, 2025
Collaborator

Can ragflow parse html with js correctly?
You mean rendering HTML with css & js?

1 reply

xyzhu68 Jun 16, 2025
Author

yes, in some html files, some contents are built by js. Maybe it is difficult for ragflow to extract text from it. css is not needed for me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InfiniFlow

ragflow on htmls #8082

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

InfiniFlow

ragflow on htmls #8082

Uh oh!

xyzhu68 Jun 5, 2025

Replies: 2 comments · 5 replies

Uh oh!

dosubot[bot] bot Jun 5, 2025

Uh oh!

xyzhu68 Jun 6, 2025 Author

Uh oh!

dosubot[bot] bot Jun 6, 2025

Uh oh!

xyzhu68 Jun 6, 2025 Author

Uh oh!

dosubot[bot] bot Jun 6, 2025

Uh oh!

ZhenhangTung Jun 13, 2025 Collaborator

Uh oh!

xyzhu68 Jun 16, 2025 Author

xyzhu68
Jun 5, 2025

Replies: 2 comments 5 replies

dosubot[bot]
bot Jun 5, 2025

xyzhu68 Jun 6, 2025
Author

xyzhu68 Jun 6, 2025
Author

ZhenhangTung
Jun 13, 2025
Collaborator

xyzhu68 Jun 16, 2025
Author