How to read html files #14939
Replies: 3 comments 6 replies
-
|
To properly read and load HTML files into LlamaIndex if from llama_index.readers.file import UnstructuredFileLoader
from pathlib import Path
file_path = Path('./output1.html')
loader = UnstructuredFileLoader(
file_path=file_path, continue_on_failure=False, headers={"User-Agent": "value"}
)
documents = loader.load_data()
# Process the documents as needed
for doc in documents:
print(doc.text)This code initializes the pip install llama-index-readers-fileThis approach uses the |
Beta Was this translation helpful? Give feedback.
-
|
I hit a similar class of problem while trying to get HTML into an LLM/RAG flow. If the goal is just “HTML file → clean text for indexing,” you may be able to avoid the Unstructured/libmagic path entirely by preprocessing the HTML first, then passing plain text into LlamaIndex. That’s the pattern I ended up building around because raw HTML eats a ton of tokens 😅 I made a small Apify actor for the cleanup step: It accepts raw HTML or URLs and returns clean text + word_count. Not a LlamaIndex integration, just a plug-in preprocessing step before ingestion. |
Beta Was this translation helpful? Give feedback.
-
|
A simple path is to avoid For local from llama_index.core import SimpleDirectoryReader
reader = SimpleDirectoryReader(
input_dir="./data",
required_exts=[".html", ".htm"],
recursive=True,
encoding="utf-8",
)
documents = reader.load_data()That will keep the HTML markup in the document text. If you want cleaner text before indexing, I would usually either preprocess with BeautifulSoup first, or use the file reader package's pip install llama-index-readers-filefrom llama_index.readers.file import HTMLTagReader
reader = HTMLTagReader(tag="body", ignore_no_id=True)
documents = reader.load_data("./data/page.html")So the rough choice is:
Hope that helps narrow down the failure mode a bit. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I have some documents in .html files. How to load them in llama index? I tried UnstructuredReader. It's not working.
Beta Was this translation helpful? Give feedback.
All reactions