Skip to content

Import bookdash.org as a Trusted Book Provider (with editions for all languages) #10856

@mekarpeles

Description

@mekarpeles

Feature Request

Here is the sitemap of books for bookdash
https://bookdash.org/books-sitemap.xml

This python code gives us a list of the ~850 urls:

import requests
import xml.etree.ElementTree as ET

# Step 1: Get the XML from the sitemap URL
url = "https://bookdash.org/books-sitemap.xml"
response = requests.get(url)
response.raise_for_status()  # Raises an error if the request failed

# Step 2: Parse the XML
root = ET.fromstring(response.content)

# Step 3: Extract all <loc> elements (which are like the rows of the sitemap)
namespaces = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}  # Define XML namespace
locs = root.findall('.//ns:loc', namespaces)

# Step 4: Print each URL (you could treat this like the "td" values)
book_urls = [loc.text for loc in locs]
print(json.dumps(book_urls))

Here is the list of the URLs:
https://gist.githubusercontent.com/mekarpeles/dc5e89708bbf7cae38b35f071021a12a/raw/99abd429075ae5246caa15b8878ad0c59456f38a/bookdashorg_urls.json

re: metadata:

  • <picture> is the cover
  • <h1> is the title
  • ISBN: will be on the page before the ISBN
  • contents in with href "/languages/" will be the language
  • description: use meta og:description
  • with href starting with "/themes" will be subjects
  • authors: search for "https://bookdash.org/team-members/"
  • publisher: bookdash

Problem / Opportunity

Write a basic python request scraper to respectfully curl each book page, extract the metadata in the right formats, as an openlibrary book in json

Proposal

Breakdown

Related files

Refer to this map of common Endpoints:
*

Requirements Checklist

Checklist of requirements that need to be satisfied in order for this issue to be closed:

  • [ ]

Stakeholders


Instructions for Contributors

Metadata

Metadata

Labels

Fellowship OpportunityLead: @mekarpelesIssues overseen by Mek (Staff: Program Lead) [managed]Module: ImportIssues related to the configuration or use of importbot and other bulk import systems. [managed]Needs: BreakdownThis big issue needs a checklist or subissues to describe a breakdown of work. [managed]Priority: 2Important, as time permits. [managed]Theme: Trusted Book ProvidersType: Feature RequestIssue describes a feature or enhancement we'd like to implement. [managed]

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions