Skip to content

🧪 PoC: Crawler + IAB Taxonomy Classification using OpenAI #59

@hanishi

Description

@hanishi

I built a simple headless-browser crawler that extracts content + internal links from web pages and integrates IAB taxonomy classification via OpenAI.
It’s not production-ready, but useful as a proof of concept—especially if you’re working on automated tagging, contextual ad targeting, or content classification pipelines.

GitHub 👉 https://github.com/hanishi/pekko-playwright

Highlights:
• Reactive architecture using Apache Pekko (Akka) + Playwright for DOM-aware extraction
• Starts from a target element and gathers clean text + filtered internal links
• IAB taxonomy classification using OpenAI’s API (currently via pageContent → OpenAI → taxonomy_id)
• Practical motivations: improve contextual tagging for better CPM and cleaner ad delivery environments

Why?

As someone working in AdTech, I’ve seen how poor or missing taxonomy tagging leads to:
• Lower CPMs due to mismatched bids
• Unwanted ads on sensitive content
• Frustration on both publisher and buyer side

So this PoC is my small step toward cleaner, better targeted, and more trustworthy ad environments.
Hope it’s useful to someone—feedback or collaboration welcome!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions