gazpacho is a simple, fast, and modern web scraping library. The library is stable, and installed with zero dependencies.
Install with pip at the command line:
pip install -U gazpacho
Give this a try:
from gazpacho import get, Soup
url = 'https://scrape.world/books'
html = get(url)
soup = Soup(html)
books = soup.find('div', {'class': 'book-'}, partial=True)
def parse(book):
name = book.find('h4').text
price = float(book.find('p').text[1:].split(' ')[0])
return name, price
[parse(book) for book in books]Import gazpacho following the convention:
from gazpacho import get, SoupUse the get function to download raw HTML:
url = 'https://scrape.world/soup'
html = get(url)
print(html[:50])
# '<!DOCTYPE html>\n<html lang="en">\n <head>\n <met'Adjust get requests with optional params and headers:
get(
url='https://httpbin.org/anything',
params={'foo': 'bar', 'bar': 'baz'},
headers={'User-Agent': 'gazpacho'}
)Use the Soup wrapper on raw html to enable parsing:
soup = Soup(html)Soup objects can alternatively be initialized with the .get classmethod:
soup = Soup.get(url)Use the .find method to target and extract HTML tags:
h1 = soup.find('h1')
print(h1)
# <h1 id="firstHeading" class="firstHeading" lang="en">Soup</h1>Use the attrs argument to isolate tags that contain specific HTML element attributes:
soup.find('div', attrs={'class': 'section-'})Element attributes are partially matched by default. Turn this off by setting partial to False:
soup.find('div', {'class': 'soup'}, partial=False)Override the mode argument {'auto', 'first', 'all'} to guarantee return behaviour:
print(soup.find('span', mode='first'))
# <span class="navbar-toggler-icon"></span>
len(soup.find('span', mode='all'))
# 8Soup objects have html, tag, attrs, and text attributes:
dir(h1)
# ['attrs', 'find', 'get', 'html', 'strip', 'tag', 'text']Use them accordingly:
print(h1.html)
# '<h1 id="firstHeading" class="firstHeading" lang="en">Soup</h1>'
print(h1.tag)
# h1
print(h1.attrs)
# {'id': 'firstHeading', 'class': 'firstHeading', 'lang': 'en'}
print(h1.text)
# SoupIf you use gazpacho, consider adding the badge to your project README.md:
[](https://github.com/maxhumber/gazpacho)For feature requests or bug reports, please use Github Issues
For PRs, please read the CONTRIBUTING.md document
