-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Welcome to the official PipeScraper Wiki. This is the comprehensive technical manual and reference guide for the pipescraper library.
- Core Philosophy
- Installation & Setup
- Mastering the Pipe
>> - Google News: Discovery & Decoding
- Extraction Engines
- High-Performance Patterns
- Data Science Integrations
- Advanced Configuration
- Author & Citation
- Learning Resources
PipeScraper is designed around Functional Data Pipeling. Instead of complex nested objects or callback hell, data flows through concentrated "verbs."
- Readability: Code should look like a description of the task.
- Portability: Pipelines are objects that can be shared and reused.
- Speed: Built-in parallelism for I/O bound extraction tasks.
- Python 3.8+
pandastrafilaturanewspaper4kgnews
pip install pipescraperpip install "pipescraper[all]"This includes pipeframe for manipulation and pipeplotly for visualization.
The >> operator in PipeScraper uses the __rrshift__ magic method. It allows you to chain a Producer (like a URL string or FetchGoogleNews) to a series of Transformers (like ExtractArticles) and finally to a Consumer (like ToDataFrame).
from pipescraper import FetchGoogleNews, ExtractArticles, ToDataFrame
# The Producer: FetchGoogleNews
# The Transformer: ExtractArticles
# The Consumer: ToDataFrame
df = (FetchGoogleNews(search="Artificial Intelligence") >>
ExtractArticles(workers=10) >>
ToDataFrame())One of the most powerful features of v0.3.0 is the Mimetic Decoder.
Google News URLs (e.g., news.google.com/rss/articles/...) redirect to article destinations. However, Google often serves a Consent Wall to automated tools.
We use a specialized GoogleNewsFetcher that:
-
Discovery: Uses
gnewsandRSSto find encoded links. -
Mimicry: Performs a
batchexecutePOST request mimicking a Chrome browser. -
Authentication: Uses a hardcoded
SOCScookie to signal consent. - Parallelism: Decoding 20 URLs sequentially takes ~30s; PipeScraper does it in ~5s using threads.
You can search for multiple topics at once. PipeScraper will run them in parallel and merge the results:
FetchGoogleNews(search=["NVIDIA", "AMD stock", "Intel AI"], max_results=5)
# Returns a single deduplicated list of URLs.PipeScraper uses a Dual-Engine Strategy:
- Trafilatura (Primary): Best-in-class main text and metadata extraction.
- Newspaper4k (Supplement): Specifically used to extract publication time, which Trafilatura often misses.
Every extraction results in an Article object:
-
url,source,title -
text(Cleaned body text) descriptionauthor-
date_published&time_published -
language,tags,image_url
Always use the workers parameter in ExtractArticles for production workflows.
- Safe: 3-5 workers.
- Aggressive: 10-20 workers (ensure you aren't being rate-limited).
The SaveAs verb supports .csv, .json, .parquet, and .xlsx. Parquet is recommended for large datasets to preserve types.
Transform your results with dplyr-style syntax:
from pipeframe import select, filter, mutate
df = (articles >> ToPipeFrame() >>
mutate(word_count=lambda d: d['text'].str.split().str.len()) >>
filter(lambda d: d['word_count'] > 500) >>
select('title', 'word_count'))Visualizing news trends:
from pipeplotly import ggplot, aes, geom_bar
(df >> ggplot(aes(x='source')) >> geom_bar() >> show())Dr. Yasser Mustafa
AI & Data Science Specialist
Newcastle Upon Tyne, UK
GitHub.com/Yasser03 | LinkedIn
@software{pipescraper2026,
author = {Mustafa, Yasser},
title = {PipeScraper: High-Performance News Extraction},
url = {https://github.com/Yasser03/pipescraper},
version = {0.3.0},
year = {2026}
}- Tutorial Notebook - A complete, hands-on, end-to-end walkthrough
- API Reference - Detailed core documentation
- Examples - More advanced usage examples
- Contributing Guide - How to contribute
PipeScraper: Making news scraping as easy as natural language. π