A pipe-based news article scraping and metadata extraction library for Python
pipescraper provides a natural language verb-based interface for scraping news websites and extracting structured article metadata using the intuitive pipe (>>) operator. Built on top of trafilatura with supplementary time extraction via newspaper4k, pipescraper combines powerful extraction capabilities with an elegant, chainable API.
from pipescraper import *
# Your scraping pipeline reads like a story
result = ("https://www.bbc.com/news" # Replace with your target URL
>> FetchLinks(max_links=10)
>> ExtractArticles()
>> FilterArticles(lambda a: a.language == 'en')
>> ToDataFrame()
>> SaveAs("articles.csv")
)π‘ How to read
>>: Read the>>operator as "pipe to" or "then". For example, the code above reads as: "Take the URL, then fetch links, then extract articles, then filter for English articles... "
# β Traditional logic is nested, hard to read, and error-prone
urls = fetch_links("https://www.bbc.com/news", max_links=10) # Replace with your target URL
articles = []
for url in urls:
time.sleep(1)
art = extract_article(url)
if art.language == 'en' and art.author:
articles.append(art)
save_to_csv(articles, "articles.csv")
# β
pipescraper: Clear and intuitive
("https://www.bbc.com/news" # Replace with your target URL
>> FetchLinks(max_links=10)
>> ExtractArticles(delay=1.0)
>> FilterArticles(lambda a: a.language == 'en' and bool(a.author))
>> ToDataFrame()
>> SaveAs("articles.csv")
)- π Pipe-based syntax β Chain operations naturally with the
>>operator - π° Comprehensive metadata extraction β Extract URL, source, title, text, author, dates, language, and more
- β° Publication time parsing β Supplement trafilatura's date extraction with full timestamp support
- π€ Respectful scraping β Built-in robots.txt compliance and request throttling
- π Google News Search β Search for keywords or sentences across regions and time periods β NEW
- π§ Automatic URL Decoding β Parallel
batchexecutedecoder for Google News (bypasses consent wall) β NEW - π Pandas integration β Export to DataFrame with CSV, JSON, Excel support
- π― Flexible filtering β Filter articles by language, author, content length, or custom criteria
- π§Ή Automatic deduplication β Remove duplicate articles by URL
- β‘ Parallel Scraping β Turbocharge batch extraction with multi-threaded workers
- π§ PipeFrame integration β Use all PipeFrame verbs (select, filter, mutate, arrange, etc.) for data manipulation
- π PipePlotly integration β Create visualizations with Grammar of Graphics using ggplot, geom_bar, geom_point, etc.
# Basic installation
pip install pipescraper
# Install with all optional integrations (PipeFrame & PipePlotly)
pip install pipescraper[all]Or install from source:
git clone https://github.com/Yasser03/pipescraper.git
cd pipescraper
pip install -e .from pipescraper import FetchLinks, ExtractArticles, ToDataFrame, SaveAs
# Simple pipeline: URL β Links β Articles β DataFrame β CSV
df = ("https://www.bbc.com/news" # Replace with your target URL
>> FetchLinks(max_links=10)
>> ExtractArticles()
>> ToDataFrame()
>> SaveAs("articles.csv"))
print(f"Scraped {len(df)} articles successfully! π")Chain operations naturally without nested function calls or loops:
# PipeScraper approach (reads like a recipe)
articles = ("https://www.bbc.com/news" # Replace with your target URL
>> FetchLinks(max_links=20)
>> ExtractArticles(skip_errors=True)
>> Deduplicate()
>> LimitArticles(10)
)| Verb | Purpose | Example |
|---|---|---|
FetchLinks() |
Fetch article links from a base URL | >> FetchLinks(max_links=50, delay=1.0) |
ExtractArticles() |
Extract metadata from urls | >> ExtractArticles(workers=5, extract_time=True) |
FetchGoogleNews() |
Search Google News | >> FetchGoogleNews(search="SpaceX", period="1d") |
FilterArticles() |
Filter by criteria | >> FilterArticles(lambda a: a.language == 'en') |
LimitArticles() |
Limit number of articles | >> LimitArticles(10) |
Deduplicate() |
Remove duplicates | >> Deduplicate() |
ToDataFrame() |
Convert to DataFrame | >> ToDataFrame(include_text=True) |
ToPipeFrame() |
Convert to PipeFrame | >> ToPipeFrame() |
SaveAs() |
Save to file | >> SaveAs("output.csv") |
Search for specific topics from Google News, leveraging a high-performance parallel decoder that resolves consent-gated URLs automatically.
# Search for multiple related topics
search_articles = (FetchGoogleNews(
search=["latest AI breakthroughs", "quantum computing news"],
period="7d",
max_results=20)
>> ExtractArticles(workers=5)
>> ToDataFrame())Scrape safely and heavily concurrently using multi-threaded workers.
# Scrape 50 articles in parallel using 10 workers
df = ("https://www.bbc.com/news" # Replace with your target URL
>> FetchLinks(max_links=50)
>> ExtractArticles(workers=10)
>> ToDataFrame())Each article contains the following fields:
| Field | Description | Source |
|---|---|---|
url |
Article URL | Input |
source |
Domain/source name | Parsed |
title |
Article headline | Trafilatura / newspaper4k |
text |
Main article content | Trafilatura |
description |
Article summary | Trafilatura |
author |
Author name(s) | Trafilatura / newspaper4k |
date_published |
Publication date (YYYY-MM-DD) | Trafilatura / newspaper4k |
time_published |
Publication time (HH:MM:SS) | newspaper4k β |
language |
Language code (e.g., 'en') | Trafilatura |
tags |
Article tags/categories | Trafilatura |
image_url |
Main article image | Trafilatura / newspaper4k |
β Note: time_published is extracted via newspaper4k to supplement trafilatura, which only provides dates.
Install PipeFrame (pip install pipescraper[pipeframe]) and PipePlotly (pip install pipescraper[pipeplotly]) for seamless end-to-end pipelines:
from pipescraper import ExtractArticles, ToPipeFrame
from pipeframe import filter, arrange, group_by, summarize
from pipeplotly import ggplot, aes, geom_bar, theme_minimal
# Full Pipeline: Scrape -> Mutate -> Group -> Plot
fig = ("https://www.bbc.com/news" # Replace with your target URL
>> FetchLinks(max_links=20)
>> ExtractArticles()
>> ToPipeFrame()
>> filter(lambda df: df['author'].notna())
>> arrange('date_published', ascending=False)
>> ggplot(aes(x='source'))
>> geom_bar()
>> theme_minimal())
fig.show()Configure delays and robots.txt compliance.
result = ("https://www.bbc.com/news" # Replace with your target URL
>> FetchLinks(
max_links=50,
respect_robots=True,
delay=3.0,
user_agent="MyBot/1.0 (contact@example.com)"
)
>> ExtractArticles(delay=2.0)
>> FilterArticles(lambda a: a.language == 'en' and bool(a.author))
>> LimitArticles(20)
>> Deduplicate()
>> ToDataFrame(include_text=False)
>> SaveAs("respectful_scrape.csv"))Extract from a specific URL or list of URLs without link discovery.
df = ("https://www.bbc.com/news/specific-article" # Replace with your target URL
>> ExtractArticles()
>> ToDataFrame()
>> SaveAs("single_article.json"))| Feature | pipescraper | Trafilatura |
|---|---|---|
| Content extraction | β (via trafilatura) | β |
| Metadata extraction | β Enhanced | β Basic |
| Publication time | β (via newspaper4k) | β (date only) |
| Pipe syntax | β | β |
| Link discovery | β | β |
| Batch / Parallel | β | Manual |
| DataFrame export | β (CSV/JSON/Excel) | β |
| Google News Filter | β | β |
Design Decision: pipescraper uses a dual-engine approach. Trafilatura provides industry-leading content extraction, while newspaper4k complements it by capturing the exact time_published, ensuring complete temporal metadata.
- Tutorial Notebook - A complete, hands-on, end-to-end walkthrough
- API Reference - Detailed core documentation
- Examples - More advanced usage examples
- Contributing Guide - How to contribute
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
MIT License - see the LICENSE file for details.
Dr. Yasser Mustafa
AI & Data Science Specialist | Theoretical Physics PhD
- π PhD in Theoretical Nuclear Physics
- πΌ 10+ years in production AI/ML systems
- π¬ 48+ research publications
- π’ Experience: Government (Abu Dhabi), Media (Track24), Recruitment (Reed), Energy (ADNOC)
- π Based in Newcastle Upon Tyne, UK
- βοΈ yasser.mustafan@gmail.com
- π LinkedIn | GitHub
PipeScraper was born from the need for a more intuitive, pipe-based approach to news scraping, combining the analytical power of trafilatura with the elegance of a functional programming interface.
If PipeScraper helps your work, please consider giving it a star! β
If you use PipeScraper in your research or project, please cite it as follows:
@software{pipescraper2026,
author = {Mustafa, Yasser},
title = {PipeScraper: A pipe-based news article scraping and metadata extraction library},
url = {https://github.com/Yasser03/pipescraper},
version = {0.3.0},
year = {2026}
}- trafilatura β Core content extraction engine
- newspaper4k β Supplementary time extraction
- pipeframe β Inspiration for pipe-based syntax
- pipeplotly β Pipe pattern implementation reference
- Issues: Report bugs or request features
- Discussions: Ask questions, share use cases
Made with β€οΈ by Dr. Yasser Mustafa