-
Notifications
You must be signed in to change notification settings - Fork 45
Description
I built a simple headless-browser crawler that extracts content + internal links from web pages and integrates IAB taxonomy classification via OpenAI.
It’s not production-ready, but useful as a proof of concept—especially if you’re working on automated tagging, contextual ad targeting, or content classification pipelines.
GitHub 👉 https://github.com/hanishi/pekko-playwright
Highlights:
• Reactive architecture using Apache Pekko (Akka) + Playwright for DOM-aware extraction
• Starts from a target element and gathers clean text + filtered internal links
• IAB taxonomy classification using OpenAI’s API (currently via pageContent → OpenAI → taxonomy_id)
• Practical motivations: improve contextual tagging for better CPM and cleaner ad delivery environments
Why?
As someone working in AdTech, I’ve seen how poor or missing taxonomy tagging leads to:
• Lower CPMs due to mismatched bids
• Unwanted ads on sensitive content
• Frustration on both publisher and buyer side
So this PoC is my small step toward cleaner, better targeted, and more trustworthy ad environments.
Hope it’s useful to someone—feedback or collaboration welcome!