Skip to content

Use Case for Generators: News Feeds Aggregation Using Generators #2380

@dnovatchev

Description

@dnovatchev

In response to:
QT4CG-147-02: NW to chase up DN and LQ about follow-up to the generator discussion


Use Case: News Feeds Aggregation Using Generators

Contents

Use Case: News Feeds Aggregation Using Generators

  • Actors
  • Goals
  • Functional Requirements
  • Constraints / Assumptions / Preconditions
  • Proposed High-Level Solution
  • Known Approaches that are Problematic
  • Benefits of the Generators Approach
  • End-to-End Flow
    • Brief Description of the Core Processes in the Pipeline
    • Notes on the Process Pipeline
  • Why This Fits the Generator Datatype Extremely Well
  • Alternative Flows
    • Alternative Flow-1: A Feed Temporarily Stops Producing New Items
    • Alternative Flow-2: Partial Consumption of the Pipeline
    • Alternative Flow-3: Editor Inserts or Reorders Items 11
  • Exception Flows
    • Exception Flow-1: Feed Unreachable or Network Failure
    • Exception Flow-2: Malformed Feed Data
    • Exception Flow-3: Resource Exhaustion Risk
  • Postconditions
  • References

The Problem

Modern RSS/JSON aggregators must process hundreds of continuously updating feeds without excessive memory usage or latency, while supporting filtering, merging, and prioritization in real time.


Actors

  • End-User
  • Editor
  • Administrator
  • System components (internal processes acting as secondary actors)
  • External services (RSS providers, APIs, social signals)

Goals

  • End-User
    “As a user, I want to get the latest, up-to-the-minute news from many important sources. I want each brief news item to be presented with a link to more detailed information from the original source.”

  • Editor
    “As an editor, I want to be alerted to any change in the aggregated news-stream, as it happens continuously, and to have powerful ways of inserting, reordering, appending, prepending or deleting one or more news-items.”

  • Administrator
    “As an administrator, I want to start, stop, or restart the system, manage the configured feeds, and monitor operational health and error conditions.”


Functional Requirements

  • Consume RSS / Atom / JSON-LD feeds incrementally
  • Filter items by topic or sensitivity
  • Merge multiple feeds chronologically
  • Produce continuously updated summaries

Constraints / Assumptions / Preconditions

Assumptions

  • Feeds may be large or unbounded
  • Items arrive over time

Constraint

  • Memory usage must remain bounded

Preconditions

  • At least one news feed is configured
  • Feeds are RSS or JSON-LD and timestamped
  • Items within a feed are presented in reverse-chronological order
  • Each item contains a content-link or optionally - inline content
  • Items may belong to multiple categories

Proposed High-Level Solution

Each feed is modeled as a generator producing yield values lazily.
The ordered set of values produced by successive, demand-driven calls to move-next() is called the yield of the generator.

A generator’s yield may be finite or infinite, and may be empty for a given generator instance without implying exhaustion of the underlying data source.

Known Approaches That Are Problematic

These approaches require full materialization in memory:

  • Eager sequences (XPath)
  • DOM-style loading
  • Materialized feeds

Benefits of the Generators Approach

  • Bounded memory usage
  • Low latency
  • Composability
  • Deterministic control of evaluation

End-to-End Flow

+-------------------------------+
| 1. Feed Fetching              |
| Input:  external providers    |
| Output: G_rawItems            |
+---------------+---------------+
                |
+---------------v---------------+
| 2. Normalization              |
| Input:  G_rawItems            |
| Output: G_normalizedItems     |
+---------------+---------------+
                |
+---------------v---------------+
| 3. Filtering                  |  <-- unwanted content removed
| Input:  G_normalizedItems     |
| Output: G_filteredItems       |
+---------------+---------------+
                |
+---------------v---------------+
| 4. Topic Classification       |
| Input:  G_filteredItems       |
| Output: G_classifiedItems     |
+---------------+---------------+
                |
+---------------v---------------+
| 5. Clustering                 |
| Input:  G_classifiedItems     |
| Output: G_clusteredItems      |
+---------------+---------------+
                |
+---------------v---------------+
| 6. Ranking                    |
| Input:  G_clusteredItems      |
| Output: G_rankedItems         |
+---------------+---------------+
                |
+---------------v---------------+
| 7. Summary Page Generation    |
| Input:  G_rankedItems         |
| Output: G_summaryPageItems,   |
|         HTML                  |
+---------------+---------------+
                |
+---------------v---------------+
| 8. Detail Page Generation     |
| Input:  G_summaryPageItems    |
| Output: HTML Detail Pages     |
+-------------------------------+

Remarks

  1. The participating generator instances are named using the convention G_{name}.
  2. Every stage except the final one produces a new generator.
  3. Every stage except the very first uses a generator as its input.
  4. Arrow semantics: the output generator of one stage is the input for the next stage.

Brief Description of the Core Processes in the Pipeline

Process 1 — Feed Fetching & Acquisition

Goal:
Continuously pull RSS / Atom / JSON-LD feeds from CNN, Fox, NBC, BBC, etc.

Includes:

  • Periodic polling (e.g., every 5 minutes)
  • Detection of new items (GUID, URL hash, published timestamps)
  • N-way merging to ensure the resulting yield is sorted in reverse-chronological order
  • Basic sanity validation (e.g., XML schema validity)

Output:
A generator whose yield values are raw feed items (XML / JSON documents) → input to Process 2.


Process 2 — Parsing & Normalization

Goal:
Convert heterogeneous raw feed items into a uniform internal format.

Normalized fields include:

  • Title
  • Description / Summary
  • Full text (if available)
  • URL
  • Publication time (converted to UTC)
  • Source
  • Images, categories, tags
  • Named entities (optional NLP-based enrichment)

Output:
A generator yielding clean, normalized NewsItem documents → input to Process 3.


Process 3 — Content Filtering & Exclusion Rules

Goal:
Remove unwanted items early using configurable rule sets.

Examples:

  • Blocked topics: politics, celebrity gossip, violence, etc.
  • Blocked entities: Donald Trump, Joe Biden, Kanye West, etc.
  • Blocked publishers (optional)
  • Expiration rules:
    • Tech news stale after 48 hours
    • Breaking news stale after 6 hours

Techniques:

  • Keyword filtering
  • Named Entity Recognition (NER)
  • Sensitive-topic classifiers (ML-based)
  • Freshness scoring

Output:
A generator yielding allowed, filtered NewsItem documents → input to Process 4.
Rejected items are stored separately for auditing.


Process 4 — Topic Classification

Goal:
Assign each item to one or more topics.

Example topics:

  • Politics
  • World
  • Tech
  • Health
  • Sports
  • Business
  • Disasters / Urgent events
  • Crime / Safety
  • Entertainment

Approaches:

  • Fine-tuned BERT classifier (preferred)
  • TF-IDF + SVM (simpler)
  • Feed-provided category tags (fallback)

Output:
A generator yielding categorized NewsItem documents → input to Process 5.


Process 5 — Similarity Analysis & Clustering

Goal:
Group news items from different sources describing the same event.

Techniques:

  • Semantic vector embeddings (e.g., SBERT, Ada embeddings)
  • Cosine similarity
  • Hierarchical clustering or DBSCAN

Produces:

  • Clusters of highly similar articles
  • A primary (best) representative per cluster

Output:
A generator yielding clusters of related articles → input to Process 6.

Note:
To better match streaming behavior, clustering may operate within bounded windows (e.g., sliding windows) while still consuming the input generator.


Process 6 — Ranking, Urgency, and Freshness Scoring

Goal:
Prioritize which news appears on the Summary Page.

Computed scores:

  • Freshness score (more recent → higher)
  • Urgency score (disasters, crises, violence)
  • Coverage score (number of sources reporting)
  • Engagement score (optional: social signals)

Weighted formula:

FinalScore = a*Urgency + b*Freshness + c*Coverage + d*EditorRules

Items with the highest scores per topic are selected.

This stage does not require a full total ordering; instead a partial ordering (e.g., top-K per topic) preserves bounded memory.

Editor-driven operations (insert, remove, reorder) are modeled as generator transformations applied downstream of ranking.

Output:
A generator yielding ranked clusters → input to Process 7.


Process 7 — Summary Page Generation

This stage consumes the input generator and produces finite views intended for presentation.

Goal:
Build a continuously updated Summary Page (“Front Page”) containing:

  • Top events per topic
  • Short summaries
  • Links to primary articles
  • “Read similar news” (cluster siblings)
  • Source icons
  • Timestamp of most recent update

The page auto-refreshes and always reflects the newest items.


Process 8 — Detailed Pages & Cross-Links

This stage consumes its input generator and produces finite presentation views.

For each cluster:

  • Canonical article (primary representative)
  • Related articles across sources
  • Timeline of developments
  • Additional metadata (images, entities, tags)

Cross-links include:

  • “More like this…”
  • “Earlier developments…”
  • “Follow-up stories…”

Notes on the Process Pipeline

  • Feed Fetching typically wraps one or more data providers
    → produces G_rawItems lazily (RSS, JSON APIs, DB cursors, web services)
  • Every stage is expressible as:
    • for-each, filter, append, prepend, insert-at, remove-where, concat, or fold, etc., producing a new generator derived from the previous one
  • No stage requires full materialization unless explicitly demanded
    (e.g., to-array, bounded sort, pagination)
  • Infinite generators are valid until stage 6; stages 7–8 typically consume finite prefixes (take(n))

Why This Fits the Generator Datatype Extremely Well

  • The pipeline is a composition of generator transformers
  • Each box maps almost 1-to-1 to generator operations
  • External data providers integrate naturally at Stage 1
  • Sorting can be introduced in different ways:
    • External merge-sort over generators
    • Bounded-window ranking
    • Top-K lazy ranking – e.g. using heaps.

Alternative Flows

Alternative Flow 1 — Feed Temporarily Stops Producing New Items

Condition:
A feed is reachable but has no new items since the last polling cycle.

Flow:

  1. The feed generator advances (move-next()).
  2. The data provider returns no new items.
  3. The feed-generator instance yields no items during this interval.
  4. Downstream generators remain operational.
  5. If all feeds are empty, no new items are added downstream.

Result:
The pipeline continues uninterrupted; no special handling is required.


Alternative Flow 2 — Partial Consumption of the Pipeline

Condition:
Only a finite prefix of the stream is required (e.g., top N items).

Flow:

  1. Downstream consumers apply take(N).
  2. Upstream generators are evaluated only as needed.
  3. Remaining potential yield values are never materialized.

Result:
Latency and memory usage remain bounded. The pipeline supports early termination naturally.


Alternative Flow 3 — Editor Inserts or Reorders Items

Condition:
An editor manually modifies the aggregated stream.

Flow:

  1. Editor operations are applied as generator transformations
    (append, prepend, insert-at, remove-at, remove-where).
  2. A new generator with the modified yield is produced.
  3. Downstream stages consume it transparently.

Result:
Editorial control integrates seamlessly without breaking the pipeline.


Exception Flows

Exception Flow 1 — Feed Unreachable or Network Failure

Condition:
A feed cannot be reached during polling.

Flow:

  1. The data provider reports an error or timeout.
  2. The next instance of the feed generator yields no items during this polling interval.
  3. The error is logged for monitoring.
  4. A retry policy (e.g., exponential backoff) is applied.

Result:
The system continues operating with remaining feeds.


Exception Flow 2 — Malformed Feed Data

Condition:
A feed item is malformed (invalid XML/JSON or schema validation problems, e.g. missing required fields).

Flow:

  1. The normalization stage detects the issue.
  2. The item is discarded or quarantined.
  3. Processing continues with subsequent items.

Result:
Malformed data does not propagate downstream.


Exception Flow 3 — Resource Exhaustion Risk

Condition:
A downstream operation risks exceeding memory limits.

Flow:

  1. Bounded strategies (windowing, top-K selection) are applied.
  2. Full materialization is avoided.
  3. If needed, the operation degrades gracefully (e.g., reduced clustering depth).

Result:
System stability is preserved under load.


Postconditions

Upon successful execution:

Functional Outcomes

  • End users see an up-to-date Summary Page.
  • Each summary item links to a Detailed Page.
  • Editors can intervene using generator operations.
  • Administrators retain full system control.

Technical Guarantees

  • Memory usage remains bounded.
  • Latency is minimized through lazy evaluation.
  • Full materialization occurs only when explicitly requested.

System State

  • All generators remain composable.
  • Generator composition remains valid after alternative and exceptional flows.
  • Empty generators correctly represent exhaustion.
  • Infinite yields are supported up to stages that require finiteness.

References

  1. RSS 2.0 Specification
    https://www.rssboard.org/rss-specification

  2. Atom Publishing Protocol (RFC 5023)
    https://www.rfc-editor.org/rfc/rfc5023

  3. JSON-LD Specification
    https://json-ld.org/spec/

  4. TF-IDF, “Understanding TF-IDF (Term Frequency-Inverse Document Frequency)”,
    https://www.geeksforgeeks.org/machine-learning/understanding-tf-idf-term-frequency-inverse-document-frequency/

  5. TF-IDF + SVM, “Strengthening Fake News Detection: Leveraging SVM and Sophisticated Text Vectorization Techniques. Defying BERT?”,
    https://arxiv.org/html/2411.12703v1

  6. Sentence-BERT (SBERT)
    Reimers, N. & Gurevych, I., 2019
    https://arxiv.org/abs/1908.10084

  7. Fine-tuned BERT, “Fine-tuning a BERT model”,
    https://www.tensorflow.org/tfmodels/nlp/fine_tune_bert

  8. Ada Embeddings (OpenAI)
    Radford et al., 2021
    https://arxiv.org/abs/2103.00020

  9. Cosine Similarity
    https://en.wikipedia.org/wiki/Cosine_similarity

  10. Hierarchical Clustering
    https://en.wikipedia.org/wiki/Hierarchical_clustering

  11. DBSCAN
    Ester et al., 1996
    https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    EditorialMinor typos, wording clarifications, example fixes, etc.XQFOAn issue related to Functions and Operators

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions