How to read html files #14939

SarahSohana27 · 2024-07-24T15:55:16Z

SarahSohana27
Jul 24, 2024

I have some documents in .html files. How to load them in llama index? I tried UnstructuredReader. It's not working.

@dosu · 2024-07-24T16:03:28Z

dosubot[bot]
Bot Jul 24, 2024

To properly read and load HTML files into LlamaIndex if UnstructuredReader is not working, you can use the UnstructuredFileLoader class provided in the llama_index.readers.file module. Here is an example of how you can use this class:

from llama_index.readers.file import UnstructuredFileLoader
from pathlib import Path

file_path = Path('./output1.html')

loader = UnstructuredFileLoader(
    file_path=file_path, continue_on_failure=False, headers={"User-Agent": "value"}
)
documents = loader.load_data()

# Process the documents as needed
for doc in documents:
    print(doc.text)

This code initializes the UnstructuredFileLoader with the path to your HTML file and loads the HTML content into Document objects, which you can then process as needed. Make sure to install the necessary package with:

pip install llama-index-readers-file

This approach uses the Unstructured.io service to partition HTML documents and extract their text content ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

6 replies

dosubot[bot] Bot Jul 24, 2024

To resolve the import error, you should use the UnstructuredReader class instead of UnstructuredFileLoader. Here is the correct way to load HTML files using UnstructuredReader:

Ensure you have the llama-index-readers-file package installed:

pip install llama-index-readers-file

Use the following code to load your HTML files:

from pathlib import Path
from llama_index.readers.file.unstructured import UnstructuredReader

loader = UnstructuredReader()
documents = loader.load_data(file=Path("./output1.html"))

# Process the documents as needed
for doc in documents:
    print(doc.text)

If you encounter a "failed to find libmagic" error, you can try installing python-magic-bin:

pip install python-magic-bin==0.4.14

On MacOS, you may also need to install libmagic using Homebrew:

brew install libmagic

This should resolve the import error and allow you to load HTML files successfully ^[1]^[2].

To continue talking to Dosu, mention @dosu.

SarahSohana27 Jul 24, 2024
Author

@dosu getting this error while running it:
raise ImportError('failed to find libmagic. Check your installation')
ImportError: failed to find libmagic. Check your installatio

dosubot[bot] Bot Jul 24, 2024

To resolve the ImportError: failed to find libmagic issue, you can try the following solutions:

Install the python-magic-bin package:
```
pip install python-magic-bin==0.4.14
```
If you are on MacOS, you may also need to install libmagic using Homebrew:
```
brew install libmagic
```

After resolving the libmagic issue, you can use the UnstructuredReader class as shown in the example:

from pathlib import Path
from llama_index.readers.file.unstructured import UnstructuredReader

loader = UnstructuredReader()
documents = loader.load_data(file=Path("./output1.html"))

# Process the documents as needed
for doc in documents:
    print(doc.text)

This should help you load your HTML files successfully ^[1].

To continue talking to Dosu, mention @dosu.

SarahSohana27 Jul 24, 2024
Author

still having the same issue

dosubot[bot] Bot Jul 24, 2024

To further troubleshoot the ImportError: failed to find libmagic issue, please ensure the following steps are completed:

Install python-magic-bin:
```
pip install python-magic-bin==0.4.14
```
On MacOS, install libmagic using Homebrew:
```
brew install libmagic
```
Verify the installation:
Ensure that libmagic is correctly installed and accessible. You can check if libmagic is installed by running:
```
file --version
```
Set the MAGIC environment variable:
If the above steps do not resolve the issue, you may need to set the MAGIC environment variable to point to the location of the magic.mgc file. For example:
```
export MAGIC=/usr/local/Cellar/libmagic/5.39/share/misc/magic.mgc
```
Restart your Python environment:
After making these changes, restart your Python environment to ensure the changes take effect.

If the issue persists, please provide more details about your operating system and Python environment so that further assistance can be provided.

To continue talking to Dosu, mention @dosu.

cameronlares · 2026-05-25T00:30:51Z

cameronlares
May 25, 2026

I hit a similar class of problem while trying to get HTML into an LLM/RAG flow.

If the goal is just “HTML file → clean text for indexing,” you may be able to avoid the Unstructured/libmagic path entirely by preprocessing the HTML first, then passing plain text into LlamaIndex.

That’s the pattern I ended up building around because raw HTML eats a ton of tokens 😅

I made a small Apify actor for the cleanup step:
https://apify.com/larelabs/refinery-html-to-llm-cleaner

It accepts raw HTML or URLs and returns clean text + word_count. Not a LlamaIndex integration, just a plug-in preprocessing step before ingestion.

0 replies

Ghraven · 2026-05-27T09:27:39Z

Ghraven
May 27, 2026

A simple path is to avoid UnstructuredReader unless you specifically need its parsing stack.

For local .html files, SimpleDirectoryReader can load them as text, and you can restrict the directory to HTML files:

from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(
    input_dir="./data",
    required_exts=[".html", ".htm"],
    recursive=True,
    encoding="utf-8",
)
documents = reader.load_data()

That will keep the HTML markup in the document text. If you want cleaner text before indexing, I would usually either preprocess with BeautifulSoup first, or use the file reader package's HTMLTagReader when you want to extract a specific tag/section from the HTML:

pip install llama-index-readers-file

from llama_index.readers.file import HTMLTagReader

reader = HTMLTagReader(tag="body", ignore_no_id=True)
documents = reader.load_data("./data/page.html")

So the rough choice is:

use SimpleDirectoryReader if treating the HTML file as text is enough
use HTMLTagReader or a BeautifulSoup cleanup step if the raw markup is hurting retrieval quality

Hope that helps narrow down the failure mode a bit.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to read html files #14939

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to read html files #14939

Uh oh!

SarahSohana27 Jul 24, 2024

Replies: 3 comments · 6 replies

Uh oh!

dosubot[bot] Bot Jul 24, 2024

Uh oh!

dosubot[bot] Bot Jul 24, 2024

Uh oh!

SarahSohana27 Jul 24, 2024 Author

Uh oh!

dosubot[bot] Bot Jul 24, 2024

Uh oh!

SarahSohana27 Jul 24, 2024 Author

Uh oh!

dosubot[bot] Bot Jul 24, 2024

Uh oh!

cameronlares May 25, 2026

Uh oh!

Ghraven May 27, 2026

SarahSohana27
Jul 24, 2024

Replies: 3 comments 6 replies

dosubot[bot]
Bot Jul 24, 2024

SarahSohana27 Jul 24, 2024
Author

SarahSohana27 Jul 24, 2024
Author

cameronlares
May 25, 2026

Ghraven
May 27, 2026