-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Open
Labels
Fellowship OpportunityLead: @mekarpelesIssues overseen by Mek (Staff: Program Lead) [managed]Issues overseen by Mek (Staff: Program Lead) [managed]Module: ImportIssues related to the configuration or use of importbot and other bulk import systems. [managed]Issues related to the configuration or use of importbot and other bulk import systems. [managed]Needs: BreakdownThis big issue needs a checklist or subissues to describe a breakdown of work. [managed]This big issue needs a checklist or subissues to describe a breakdown of work. [managed]Priority: 2Important, as time permits. [managed]Important, as time permits. [managed]Theme: Trusted Book ProvidersType: Feature RequestIssue describes a feature or enhancement we'd like to implement. [managed]Issue describes a feature or enhancement we'd like to implement. [managed]
Milestone
Description
Feature Request
Here is the sitemap of books for bookdash
https://bookdash.org/books-sitemap.xml
This python code gives us a list of the ~850 urls:
import requests
import xml.etree.ElementTree as ET
# Step 1: Get the XML from the sitemap URL
url = "https://bookdash.org/books-sitemap.xml"
response = requests.get(url)
response.raise_for_status() # Raises an error if the request failed
# Step 2: Parse the XML
root = ET.fromstring(response.content)
# Step 3: Extract all <loc> elements (which are like the rows of the sitemap)
namespaces = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'} # Define XML namespace
locs = root.findall('.//ns:loc', namespaces)
# Step 4: Print each URL (you could treat this like the "td" values)
book_urls = [loc.text for loc in locs]
print(json.dumps(book_urls))
Here is the list of the URLs:
https://gist.githubusercontent.com/mekarpeles/dc5e89708bbf7cae38b35f071021a12a/raw/99abd429075ae5246caa15b8878ad0c59456f38a/bookdashorg_urls.json
re: metadata:
<picture>is the cover<h1>is the titleISBN:will be on the page before the ISBN- contents in with href "/languages/" will be the language
- description: use
meta og:description - with href starting with "/themes" will be subjects
- authors: search for "https://bookdash.org/team-members/"
- publisher: bookdash
Problem / Opportunity
Write a basic python request scraper to respectfully curl each book page, extract the metadata in the right formats, as an openlibrary book in json
Proposal
Breakdown
Related files
Refer to this map of common Endpoints:
*
Requirements Checklist
Checklist of requirements that need to be satisfied in order for this issue to be closed:
- [ ]
Stakeholders
Instructions for Contributors
- Before creating a new branch or pushing up changes to a PR, please first run these commands to ensure your repository is up to date, as the pre-commit bot may add commits to your PRs upstream.
Metadata
Metadata
Assignees
Labels
Fellowship OpportunityLead: @mekarpelesIssues overseen by Mek (Staff: Program Lead) [managed]Issues overseen by Mek (Staff: Program Lead) [managed]Module: ImportIssues related to the configuration or use of importbot and other bulk import systems. [managed]Issues related to the configuration or use of importbot and other bulk import systems. [managed]Needs: BreakdownThis big issue needs a checklist or subissues to describe a breakdown of work. [managed]This big issue needs a checklist or subissues to describe a breakdown of work. [managed]Priority: 2Important, as time permits. [managed]Important, as time permits. [managed]Theme: Trusted Book ProvidersType: Feature RequestIssue describes a feature or enhancement we'd like to implement. [managed]Issue describes a feature or enhancement we'd like to implement. [managed]