Skip to content

Add structured data extractor (JSON-LD, OpenGraph, Twitter Card) #286

@justinhsu1477

Description

@justinhsu1477

Have you searched if there an existing feature request for this?

  • I have searched the existing requests

Feature description

Most modern sites embed structured metadata (<script type="application/ld+json">, <meta property="og:*">, Twitter Cards, microdata) for SEO and social sharing. This data is intentionally stable across UI redesigns, which aligns well with Scrapling's adaptive-by-default philosophy.

Today users have to parse it manually:

import json

scripts = page.css('script[type="application/ld+json"]::text').getall()
data = []
for s in scripts:
    try:
        parsed = json.loads(s)
        if isinstance(parsed, list):
            data.extend(parsed)
        elif "@graph" in parsed:
            data.extend(parsed["@graph"])
        else:
            data.append(parsed)
    except json.JSONDecodeError:
        pass

og = {
    m.attrib["property"].replace("og:", ""): m.attrib.get("content")
    for m in page.css('meta[property^="og:"]')
}

Proposed first-class API on Selector:

page.json_ld()          # list[dict]; flattens @graph; tolerant of malformed JSON
page.opengraph()        # dict of og:* meta
page.twitter_card()     # dict of twitter:* meta
page.microdata()        # list[dict] parsed from itemscope/itemprop
page.structured_data()  # everything above, grouped by source
page.metadata()         # normalized summary: {title, description, image, type, ...} fused across sources

Why this fits Scrapling:

  • Aligns with the existing ROADMAP item "Add the ability to auto-detect schemas in pages and manipulate them".
  • No new heavy deps (json is stdlib; microdata can be done with the existing lxml).
  • Pure addition on Selector — doesn't touch fetchers or the adaptive parser.

Reference implementations in the ecosystem: extruct, metascraper.

Happy to send a PR against dev if there's interest — would love early feedback on naming / scope before implementing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions