Home

📖 PipeScraper Wiki

Welcome to the official PipeScraper Wiki. This is the comprehensive technical manual and reference guide for the pipescraper library.

📑 Table of Contents

Core Philosophy
Installation & Setup
Mastering the Pipe >>
Google News: Discovery & Decoding
Extraction Engines
High-Performance Patterns
Data Science Integrations
Advanced Configuration
Author & Citation
Learning Resources

🎯 Core Philosophy

PipeScraper is designed around Functional Data Pipeling. Instead of complex nested objects or callback hell, data flows through concentrated "verbs."

Readability: Code should look like a description of the task.
Portability: Pipelines are objects that can be shared and reused.
Speed: Built-in parallelism for I/O bound extraction tasks.

📦 Installation & Setup

Requirements

Python 3.8+
pandas
trafilatura
newspaper4k
gnews

Basic Install

pip install pipescraper

Full Data Science Suite

pip install "pipescraper[all]"

This includes pipeframe for manipulation and pipeplotly for visualization.

🔗 Mastering the Pipe `>>`

The >> operator in PipeScraper uses the __rrshift__ magic method. It allows you to chain a Producer (like a URL string or FetchGoogleNews) to a series of Transformers (like ExtractArticles) and finally to a Consumer (like ToDataFrame).

Example Workflow

from pipescraper import FetchGoogleNews, ExtractArticles, ToDataFrame

# The Producer: FetchGoogleNews
# The Transformer: ExtractArticles
# The Consumer: ToDataFrame
df = (FetchGoogleNews(search="Artificial Intelligence") >> 
      ExtractArticles(workers=10) >> 
      ToDataFrame())

🌍 Google News: Discovery & Decoding

One of the most powerful features of v0.3.0 is the Mimetic Decoder.

The Challenge

Google News URLs (e.g., news.google.com/rss/articles/...) redirect to article destinations. However, Google often serves a Consent Wall to automated tools.

The PipeScraper Solution

We use a specialized GoogleNewsFetcher that:

Discovery: Uses gnews and RSS to find encoded links.
Mimicry: Performs a batchexecute POST request mimicking a Chrome browser.
Authentication: Uses a hardcoded SOCS cookie to signal consent.
Parallelism: Decoding 20 URLs sequentially takes ~30s; PipeScraper does it in ~5s using threads.

Search Aggregation

You can search for multiple topics at once. PipeScraper will run them in parallel and merge the results:

FetchGoogleNews(search=["NVIDIA", "AMD stock", "Intel AI"], max_results=5)
# Returns a single deduplicated list of URLs.

🧪 Extraction Engines

PipeScraper uses a Dual-Engine Strategy:

Trafilatura (Primary): Best-in-class main text and metadata extraction.
Newspaper4k (Supplement): Specifically used to extract publication time, which Trafilatura often misses.

Structured Data (The Article Class)

Every extraction results in an Article object:

url, source, title
text (Cleaned body text)
description
author
date_published & time_published
language, tags, image_url

⚡ High-Performance Patterns

Batch Processing

Always use the workers parameter in ExtractArticles for production workflows.

Safe: 3-5 workers.
Aggressive: 10-20 workers (ensure you aren't being rate-limited).

Persistence

The SaveAs verb supports .csv, .json, .parquet, and .xlsx. Parquet is recommended for large datasets to preserve types.

📊 Data Science Integrations

PipeFrame

Transform your results with dplyr-style syntax:

from pipeframe import select, filter, mutate

df = (articles >> ToPipeFrame() >> 
      mutate(word_count=lambda d: d['text'].str.split().str.len()) >> 
      filter(lambda d: d['word_count'] > 500) >> 
      select('title', 'word_count'))

PipePlotly

Visualizing news trends:

from pipeplotly import ggplot, aes, geom_bar

(df >> ggplot(aes(x='source')) >> geom_bar() >> show())

👨‍💻 Author & Citation

Author

Dr. Yasser Mustafa
AI & Data Science Specialist
Newcastle Upon Tyne, UK
GitHub.com/Yasser03 | LinkedIn

Citation

@software{pipescraper2026,
  author = {Mustafa, Yasser},
  title = {PipeScraper: High-Performance News Extraction},
  url = {https://github.com/Yasser03/pipescraper},
  version = {0.3.0},
  year = {2026}
}

🎓 Learning Resources

Tutorial Notebook - A complete, hands-on, end-to-end walkthrough
API Reference - Detailed core documentation
Examples - More advanced usage examples
Contributing Guide - How to contribute

PipeScraper: Making news scraping as easy as natural language. 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

📖 PipeScraper Wiki

📑 Table of Contents

🎯 Core Philosophy

📦 Installation & Setup

Requirements

Basic Install

Full Data Science Suite

🔗 Mastering the Pipe `>>`

Example Workflow

🌍 Google News: Discovery & Decoding

The Challenge

The PipeScraper Solution

Search Aggregation

🧪 Extraction Engines

Structured Data (The Article Class)

⚡ High-Performance Patterns

Batch Processing

Persistence

📊 Data Science Integrations

PipeFrame

PipePlotly

👨‍💻 Author & Citation

Author

Citation

🎓 Learning Resources

Clone this wiki locally

Home

📖 PipeScraper Wiki

📑 Table of Contents

🎯 Core Philosophy

📦 Installation & Setup

Requirements

Basic Install

Full Data Science Suite

🔗 Mastering the Pipe >>

Example Workflow

🌍 Google News: Discovery & Decoding

The Challenge

The PipeScraper Solution

Search Aggregation

🧪 Extraction Engines

Structured Data (The Article Class)

⚡ High-Performance Patterns

Batch Processing

Persistence

📊 Data Science Integrations

PipeFrame

PipePlotly

👨‍💻 Author & Citation

Author

Citation

🎓 Learning Resources

Clone this wiki locally

🔗 Mastering the Pipe `>>`