Skip to content

katacek/python-events-scraper-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Readme overview

  1. Find upcoming Python events all around the world!
  2. Creating Actor
  3. Publishing Actor to Store
  4. Monetizing Actor
  5. Creating Actor using CLI (command line interface)
  6. Creating Actor using GitHub repository (command line interface)
  7. What next (useful links)

Prerequisities

  • Account on Apify console: for creating Actor through web interface:
  • Node.js version 18 or higher with NPM installed: for using Apify CLI
  • Billing details and payment method set up: for Actor monetization

Find upcoming Python events all around the world!

We will try to find upcoming Python events all around the world, and the best website to find those is Python's official website.

Visit Python's official website events section: https://www.python.org/events/

Screenshot 2025-02-18 at 1 13 43 AM

As you can see there are a lot of upcoming events there. We will try to scrape all the upcoming events with their dates and locations and make an Actor out of it, and, in the end, publish it to Apify Store so that anybody from the community can use it.

Creating Actor

  1. Visit the page to be scraped and inspect it using browser developers tools (aka devTools)
  • page: https://www.python.org/events/
  • devTools: press F12 or Right-click a page and select Inspect
  • in the Elements tab, look for the selector for the content we want to scrape
    • (In Firefox it's called the Inspector). You can use this tab to inspect the page's HTML on the left hand side, and its CSS on the right. The items in the HTML view are called elements.
    • All elements are wrapped in the html tag such as

      for paragraph, for link, …
    • using the selector tool, find the selector: .list-recent-events.menu lifor our case
Screenshot 2025-02-13 at 15 41 51
  • you can test the selector the devtools directly, just put the document.querySelector('.list-recent-events.menu li'); to the Console tab and see the result (it prints the first result)
  • if you do document.querySelectorAll(), it shows all the given elements
  • for filtering the happening ones, just do document.querySelectorAll('.list-recent-events.menu li:not(.most-recent-events)');
    • good selectors: simplehuman-readableunique and semantically connected to the data.
Screenshot 2025-02-13 at 15 48 39
  1. Create Actor from Apify templates
  • Visit https://console.apify.com/actors/development/my-actors and click Develop new on the top right corner
  • Under Python section, select Start with Python template
  • Check the basic structure, information about the template, … and click Use this template
    • there are also links to various resources / tutorial videos
  • name the actor 😁
  1. Source code adjustments
  • in the input_schema.json update the prefill and add default value for the start url
{
    "title": "Scrape data from a web page",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "url": {
            "title": "URL of the page",
            "type": "string",
            "description": "The URL of website you want to get the data from.",
            "editor": "textfield",
            "prefill": "https://www.python.org/events/",
            "default": "https://www.python.org/events/"
        }
    },
    "required": ["url"]
}
Screenshot 2025-02-26 at 7 23 02
  • in the main.py , we are going to replace this part using the selectors we have found earlier
  • first, change line 30 as well
actor_input = await Actor.get_input() or {'url': 'https://www.python.org/events/'}

and the original

# Parse the HTML content using Beautiful Soup and lxml parser.
soup = BeautifulSoup(response.content, 'lxml')


# Extract all headings from the page (tag name and text).
headings = []
for heading in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']):
    heading_object = {'level': heading.name, 'text': heading.text}
    Actor.log.info(f'Extracted heading: {heading_object}')
    headings.append(heading_object)

# Save the extracted headings to the dataset, which is a table-like storage.
await Actor.push_data(headings)

by those

# Defines a function to extract event details from the HTML response.
def extract_event_data(html):
    # Parses the HTML using BeautifulSoup.
    soup = BeautifulSoup(html, 'html.parser')
    # Initializes an empty events list and sets a baseUrl for constructing full URLs.
    events = []
    baseUrl = 'https://www.python.org'
    
    # Finds all <li> elements inside .list-recent-events.menu
    for event in soup.select('.list-recent-events.menu li'):
        # Extract the event title <a> element.
        title_tag = event.select_one('.event-title a')
        # Extract the event date inside a <time> tag.
        date_tag = event.select_one('time')
        # Extract the event location.
        location_tag = event.select_one('.event-location')
        
        # Extracts text values and ensures they have default values ('N/A' if missing).
        title = title_tag.get_text(strip=True) if title_tag else 'N/A'
        url = title_tag['href'] if title_tag and 'href' in title_tag.attrs else 'N/A'
        date = date_tag.get_text(separator=' ', strip=True) if date_tag else 'N/A'
        location = location_tag.get_text(strip=True) if location_tag else 'N/A'
        # Constructs the full event URL by appending the relative href to baseUrl.
        fullUrl = f"{baseUrl}{url}" if url else 'N/A'
        
        # Adds the extracted data into the events list.
        events.append({
            'title': title,
            'url': fullUrl,
            'date': date,
            'location': location
        })
    
    return events

# Calls the extract_event_data() function with the page’s HTML content.
events = extract_event_data(response.content)

# Saves the extracted event data to Apify’s dataset storage (like a database for structured data).
await Actor.push_data(events)
  • now, just hit the button Save, Build & Start
  • the Actor starts and take you to the Log tab
  • results are in the Output tab
    • can be exported in various formats
    • can be also seen in Storages (main left menu) -> Datasets

Publishing Actor to Store

  • Go to Apify Console to Actor detail page
    • go to the Publication tab
    • fill in all the details
    • press Publish to store
    • check it out by clicking on Store (main menu on the left) -> search for the name of your Actor
  • docs here
Screenshot 2025-02-19 at 9 13 34

Monetizing Actor

  • at the Actor detail -> Publication tab open the Monetization card, follow the set up guide
  • basic info here
  • detailed info about pricing models here

Creating Actor through CLI

brew install apify-cli // npm -g install apify-cli

apify create
  • select name, Python and Start with Python template
Screenshot 2025-02-13 at 17 08 09
cd your-actor-name
  • in the input_schema.json update the prefill and add default value for the start url for the https://www.python.org/events/ as we did before
  • navigate to main.py and the same part of the code to be replaced
Screenshot 2025-02-13 at 17 19 04
  • run apify-run and see the results in storage/dataset/default folder 🚀
Screenshot 2025-02-13 at 17 29 37
  • push to apify platform
apify login
apify push
Screenshot 2025-02-13 at 17 35 32

Get you to the browser and see, it is there!

Screenshot 2025-02-13 at 17 36 31

Creating Actor through GitHub repository

You can easily create a new Actor from your github folder - just try to fork this repo to your workspace and follow this online guide.

Screenshot 2025-02-26 at 7 11 58

What next (useful links)

Did you enjoy scraping and want to learn more? Just check out one of the following links

About

Scrapes upcoming events from https://www.python.org - demo Actor for PyCon Namibia 25

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published