Rich Soup

Inspired by BeautifulSoup. Instead of parsing static HTML and using tags, it fully renders the page and the entire DOM (including JS/CSS & slop) using Playwright. Then, it uses semantics; i.e: avg font size versus larger font sizes, lines, gaps, spacing, hierachy/reading order; etc, to reconstruct the page into a clean JSON/Markdown format. Currently, the options are either:

BeautifulSoup; static only, messy.
Playwright; lower level, manual.
Rich Soup builds on Playwright to give the DX of BeautifalSoup but can render properly like Playwright.

Primarily intended for document-like pages; i.e: Microsoft Learn, whitepapers (PDF-like), Wiki-like sites. Best part is it uses the layout, not tags, and it's not static! It can extract from garbled DOMs with hundreds of divs and hydration from React and Astro islands and Tailwind, etc etc, perfectly fine.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
pytest.ini		pytest.ini
ruff.toml		ruff.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Rich Soup

About

Uh oh!

Releases

Packages

Languages

intercepted16/rich-soup

Folders and files

Latest commit

History

Repository files navigation

Rich Soup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages