Open
Description
crawl4ai version
0.6.2
Expected Behavior
When i try to run code for scraping local HTML files the expected outcome is to get the markdown text from the given HTML input.
Current Behavior
Currently, when I pass a local HTML file path to the crawler.arun(url=fileurl, config=crawler_config) it gives me the following error -- Error: cannot access local variable 'captured_console' where it is not associated with a value
Is this reproducible?
Yes
Inputs Causing the Bug
-URL(s): Local HTML File Url (with absolute path)
-In the crawler_config pass capture_console_messages = False (it is False by default)
Steps to Reproduce
1. Get a local URL file path.
2. Pass the url file path to crawler.arun() function
3. call the main function.
Code snippets
async def crawl_local_files(files: List[str], max_concurrent: int = 5):
browser_config = BrowserConfig(
headless=True,
verbose=False,
extra_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
)
crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, capture_console_messages = False)
crawler = AsyncWebCrawler(config=browser_config)
await crawler.start()
try:
semaphore = asyncio.Semaphore(max_concurrent)
async def process_local_file(fileurl: str):
async with semaphore:
result = await crawler.arun(
url=fileurl,
config=crawler_config
)
if result.success:
# print(result.markdown)
await process_and_store_document(fileurl, result.markdown)
else:
print(f"Failed : {fileurl} - Error: {result.error_message}")
await asyncio.gather(*[process_local_file(fileurl) for fileurl in files])
finally:
await crawler.close()
def get_file_paths() -> List[str]:
"""
Get a list of file paths from the local directory.
"""
print("Getting file paths from the local directory...")
file_paths = []
for file in os.listdir("./assets/"):
print(file)
if file.endswith(".html"):
file_path = os.path.join("./assets/", file)
abs_path = os.path.abspath(file_path)
fileurl = f"file://{abs_path}"
file_paths.append(fileurl)
return file_paths
async def main():
print("Starting the main function...")
files = get_file_paths()
await crawl_local_files(files)
if __name__ == "__main__":
asyncio.run(main())
OS
macOS Sonoma 14.7.4
Python version
3.11.7
Browser
No response
Browser version
No response