Use Case for Generators: News Feeds Aggregation Using Generators

In response to: 
`QT4CG-147-02: NW to chase up DN and LQ about follow-up to the
       generator discussion`
______________________________________


# Use Case: News Feeds Aggregation Using Generators

## Contents

 ### Use Case: News Feeds Aggregation Using Generators
 - Actors
- Goals
- Functional Requirements
- Constraints / Assumptions / Preconditions
- Proposed High-Level Solution
- Known Approaches that are Problematic
- Benefits of the Generators Approach
- End-to-End Flow
  - Brief Description of the Core Processes in the Pipeline
  - Notes on the Process Pipeline
- Why This Fits the Generator Datatype Extremely Well
- Alternative Flows
  - Alternative Flow-1: A Feed Temporarily Stops Producing New Items
  - Alternative Flow-2: Partial Consumption of the Pipeline
  - Alternative Flow-3: Editor Inserts or Reorders Items	11
- Exception Flows
  - Exception Flow-1: Feed Unreachable or Network Failure
  - Exception Flow-2: Malformed Feed Data
  - Exception Flow-3: Resource Exhaustion Risk
- Postconditions
- References

---

## The Problem

Modern RSS/JSON aggregators must process hundreds of continuously updating feeds without excessive memory usage or latency, while supporting filtering, merging, and prioritization in real time.

---

## Actors

- **End-User**
- **Editor**
- **Administrator**
- **System components** (internal processes acting as secondary actors)
- **External services** (RSS providers, APIs, social signals)

---

## Goals

- **End-User**  
  “As a user, I want to get the latest, up-to-the-minute news from many important sources. I want each brief news item to be presented with a link to more detailed information from the original source.”

- **Editor**  
  “As an editor, I want to be alerted to any change in the aggregated news-stream, as it happens continuously, and to have powerful ways of inserting, reordering, appending, prepending  or deleting one or more news-items.”

- **Administrator**  
  “As an administrator, I want to start, stop, or restart the system, manage the configured feeds, and monitor operational health and error conditions.”

---

## Functional Requirements

- Consume RSS / Atom / JSON-LD feeds incrementally
- Filter items by topic or sensitivity
- Merge multiple feeds chronologically
- Produce continuously updated summaries

---

## Constraints / Assumptions / Preconditions

### Assumptions

- Feeds may be large or unbounded
- Items arrive over time

### Constraint

- Memory usage must remain bounded

### Preconditions

- At least one news feed is configured
- Feeds are RSS or JSON-LD and timestamped
- Items within a feed are presented in reverse-chronological order
- Each item contains a content-link or optionally -  inline content
- Items may belong to multiple categories

---

## Proposed High-Level Solution

Each feed is modeled as a **generator** producing yield values lazily.  
The ordered set of values produced by successive, demand-driven calls to `move-next()` is called the **yield of the generator**.

A generator’s yield may be finite or infinite, and may be empty for a given generator instance without implying exhaustion of the underlying data source.
---

## Known Approaches That Are Problematic

These approaches require full materialization in memory:

- Eager sequences (XPath)
- DOM-style loading
- Materialized feeds

---

## Benefits of the Generators Approach

- Bounded memory usage
- Low latency
- Composability
- Deterministic control of evaluation

---

## End-to-End Flow

```text
+-------------------------------+
| 1. Feed Fetching              |
| Input:  external providers    |
| Output: G_rawItems            |
+---------------+---------------+
                |
+---------------v---------------+
| 2. Normalization              |
| Input:  G_rawItems            |
| Output: G_normalizedItems     |
+---------------+---------------+
                |
+---------------v---------------+
| 3. Filtering                  |  <-- unwanted content removed
| Input:  G_normalizedItems     |
| Output: G_filteredItems       |
+---------------+---------------+
                |
+---------------v---------------+
| 4. Topic Classification       |
| Input:  G_filteredItems       |
| Output: G_classifiedItems     |
+---------------+---------------+
                |
+---------------v---------------+
| 5. Clustering                 |
| Input:  G_classifiedItems     |
| Output: G_clusteredItems      |
+---------------+---------------+
                |
+---------------v---------------+
| 6. Ranking                    |
| Input:  G_clusteredItems      |
| Output: G_rankedItems         |
+---------------+---------------+
                |
+---------------v---------------+
| 7. Summary Page Generation    |
| Input:  G_rankedItems         |
| Output: G_summaryPageItems,   |
|         HTML                  |
+---------------+---------------+
                |
+---------------v---------------+
| 8. Detail Page Generation     |
| Input:  G_summaryPageItems    |
| Output: HTML Detail Pages     |
+-------------------------------+
```
## Remarks

1. The participating generator instances are named using the convention `G_{name}`.
2. Every stage except the final one produces a new generator.
3. Every stage except the very first uses a generator as its input.
4. Arrow semantics: the output generator of one stage is the input for the next stage.

---

## Brief Description of the Core Processes in the Pipeline

### Process 1 — Feed Fetching & Acquisition

**Goal:**  
Continuously pull RSS / Atom / JSON-LD feeds from CNN, Fox, NBC, BBC, etc.

**Includes:**
- Periodic polling (e.g., every 5 minutes)
- Detection of new items (GUID, URL hash, published timestamps)
- N-way merging to ensure the resulting yield is sorted in reverse-chronological order
- Basic sanity validation (e.g., XML schema validity)

**Output:**  
A generator whose yield values are raw feed items (XML / JSON documents) → input to Process 2.

---

### Process 2 — Parsing & Normalization

**Goal:**  
Convert heterogeneous raw feed items into a uniform internal format.

**Normalized fields include:**
- Title
- Description / Summary
- Full text (if available)
- URL
- Publication time (converted to UTC)
- Source
- Images, categories, tags
- Named entities (optional NLP-based enrichment)

**Output:**  
A generator yielding clean, normalized `NewsItem` documents → input to Process 3.

---

### Process 3 — Content Filtering & Exclusion Rules

**Goal:**  
Remove unwanted items early using configurable rule sets.

**Examples:**
- **Blocked topics:** politics, celebrity gossip, violence, etc.
- **Blocked entities:** Donald Trump, Joe Biden, Kanye West, etc.
- **Blocked publishers** (optional)
- **Expiration rules:**  
  - Tech news stale after 48 hours  
  - Breaking news stale after 6 hours

**Techniques:**
- Keyword filtering
- Named Entity Recognition (NER)
- Sensitive-topic classifiers (ML-based)
- Freshness scoring

**Output:**  
A generator yielding allowed, filtered `NewsItem` documents → input to Process 4.  
Rejected items are stored separately for auditing.

---

### Process 4 — Topic Classification

**Goal:**  
Assign each item to one or more topics.

**Example topics:**
- Politics
- World
- Tech
- Health
- Sports
- Business
- Disasters / Urgent events
- Crime / Safety
- Entertainment

**Approaches:**
- Fine-tuned BERT classifier (preferred)
- TF-IDF + SVM (simpler)
- Feed-provided category tags (fallback)

**Output:**  
A generator yielding categorized `NewsItem` documents → input to Process 5.

---

### Process 5 — Similarity Analysis & Clustering

**Goal:**  
Group news items from different sources describing the same event.

**Techniques:**
- Semantic vector embeddings (e.g., SBERT, Ada embeddings)
- Cosine similarity
- Hierarchical clustering or DBSCAN

**Produces:**
- Clusters of highly similar articles
- A primary (best) representative per cluster

**Output:**  
A generator yielding clusters of related articles → input to Process 6.

**Note:**  
To better match streaming behavior, clustering may operate within bounded windows (e.g., sliding windows) while still consuming the input generator.

---

### Process 6 — Ranking, Urgency, and Freshness Scoring

**Goal:**  
Prioritize which news appears on the Summary Page.

**Computed scores:**
- Freshness score (more recent → higher)
- Urgency score (disasters, crises, violence)
- Coverage score (number of sources reporting)
- Engagement score (optional: social signals)

**Weighted formula:**

`FinalScore = a*Urgency + b*Freshness + c*Coverage + d*EditorRules`

Items with the highest scores per topic are selected.

This stage does **not** require a full total ordering; instead a partial ordering (e.g., top-K per topic) preserves bounded memory.

Editor-driven operations (insert, remove, reorder) are modeled as generator transformations applied downstream of ranking.

**Output:**  
A generator yielding ranked clusters → input to Process 7.

---

### Process 7 — Summary Page Generation

This stage consumes the input generator and produces finite views intended for presentation.

**Goal:**  
Build a continuously updated Summary Page (“Front Page”) containing:
- Top events per topic
- Short summaries
- Links to primary articles
- “Read similar news” (cluster siblings)
- Source icons
- Timestamp of most recent update

The page auto-refreshes and always reflects the newest items.

---

### Process 8 — Detailed Pages & Cross-Links

This stage consumes its input generator and produces finite presentation views.

**For each cluster:**
- Canonical article (primary representative)
- Related articles across sources
- Timeline of developments
- Additional metadata (images, entities, tags)

**Cross-links include:**
- “More like this…”
- “Earlier developments…”
- “Follow-up stories…”

---

## Notes on the Process Pipeline

- Feed Fetching typically wraps one or more data providers  
  → produces `G_rawItems` lazily (RSS, JSON APIs, DB cursors, web services)
- Every stage is expressible as:
  - `for-each`, `filter`, `append`, `prepend`, `insert-at`, `remove-where`, `concat`, or `fold`, etc., producing a new generator derived from the previous one
- No stage requires full materialization unless explicitly demanded  
  (e.g., `to-array`, bounded sort, pagination)
- Infinite generators are valid until stage 6; stages 7–8 typically consume finite prefixes (`take(n)`)

---

## Why This Fits the Generator Datatype Extremely Well

- The pipeline is a composition of generator transformers
- Each box maps almost 1-to-1 to generator operations
- External data providers integrate naturally at Stage 1
- Sorting can be introduced in different ways:
  - External merge-sort over generators
  - Bounded-window ranking
  - Top-K lazy ranking – e.g. using heaps.

---

## Alternative Flows

### Alternative Flow 1 — Feed Temporarily Stops Producing New Items

**Condition:**  
A feed is reachable but has no new items since the last polling cycle.

**Flow:**
1. The feed generator advances (`move-next()`).
2. The data provider returns no new items.
3. The feed-generator instance yields no items during this interval.
4. Downstream generators remain operational.
5. If all feeds are empty, no new items are added downstream.

**Result:**  
The pipeline continues uninterrupted; no special handling is required.

---

### Alternative Flow 2 — Partial Consumption of the Pipeline

**Condition:**  
Only a finite prefix of the stream is required (e.g., top N items).

**Flow:**
1. Downstream consumers apply `take(N)`.
2. Upstream generators are evaluated only as needed.
3. Remaining potential yield values are never materialized.

**Result:**  
Latency and memory usage remain bounded. The pipeline supports early termination naturally.

---

### Alternative Flow 3 — Editor Inserts or Reorders Items

**Condition:**  
An editor manually modifies the aggregated stream.

**Flow:**
1. Editor operations are applied as generator transformations  
   (`append`, `prepend`, `insert-at`, `remove-at`, `remove-where`).
2. A new generator with the modified yield is produced.
3. Downstream stages consume it transparently.

**Result:**  
Editorial control integrates seamlessly without breaking the pipeline.

---

## Exception Flows

### Exception Flow 1 — Feed Unreachable or Network Failure

**Condition:**  
A feed cannot be reached during polling.

**Flow:**
1. The data provider reports an error or timeout.
2. The next instance of the feed generator yields no items during this polling interval.
3. The error is logged for monitoring.
4. A retry policy (e.g., exponential backoff) is applied.

**Result:**  
The system continues operating with remaining feeds.

---

### Exception Flow 2 — Malformed Feed Data

**Condition:**  
A feed item is malformed (invalid XML/JSON or schema validation problems, e.g. missing required fields).

**Flow:**
1. The normalization stage detects the issue.
2. The item is discarded or quarantined.
3. Processing continues with subsequent items.

**Result:**  
Malformed data does not propagate downstream.

---

### Exception Flow 3 — Resource Exhaustion Risk

**Condition:**  
A downstream operation risks exceeding memory limits.

**Flow:**
1. Bounded strategies (windowing, top-K selection) are applied.
2. Full materialization is avoided.
3. If needed, the operation degrades gracefully (e.g., reduced clustering depth).

**Result:**  
System stability is preserved under load.

---

## Postconditions

Upon successful execution:

### Functional Outcomes
- End users see an up-to-date Summary Page.
- Each summary item links to a Detailed Page.
- Editors can intervene using generator operations.
- Administrators retain full system control.

### Technical Guarantees
- Memory usage remains bounded.
- Latency is minimized through lazy evaluation.
- Full materialization occurs only when explicitly requested.

### System State
- All generators remain composable.
- Generator composition remains valid after alternative and exceptional flows.
- Empty generators correctly represent exhaustion.
- Infinite yields are supported up to stages that require finiteness.

---

## References

1. RSS 2.0 Specification  
   https://www.rssboard.org/rss-specification

2. Atom Publishing Protocol (RFC 5023)  
   https://www.rfc-editor.org/rfc/rfc5023

3. JSON-LD Specification  
   https://json-ld.org/spec/
4.	TF-IDF, “Understanding TF-IDF (Term Frequency-Inverse Document Frequency)”, 
   https://www.geeksforgeeks.org/machine-learning/understanding-tf-idf-term-frequency-inverse-document-frequency/ 
   
5.	TF-IDF + SVM, “Strengthening Fake News Detection: Leveraging SVM and Sophisticated Text Vectorization Techniques. Defying BERT?”, 
   https://arxiv.org/html/2411.12703v1


6. Sentence-BERT (SBERT)  
   Reimers, N. & Gurevych, I., 2019  
   https://arxiv.org/abs/1908.10084
   
7.	Fine-tuned BERT, “Fine-tuning a BERT model”,
   https://www.tensorflow.org/tfmodels/nlp/fine_tune_bert    

8. Ada Embeddings (OpenAI)  
   Radford et al., 2021  
   https://arxiv.org/abs/2103.00020

9. Cosine Similarity  
   https://en.wikipedia.org/wiki/Cosine_similarity

10. Hierarchical Clustering  
   https://en.wikipedia.org/wiki/Hierarchical_clustering

11. DBSCAN  
   Ester et al., 1996  
   https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf


Use Case for Generators: News Feeds Aggregation Using Generators #2380

Description

Use Case: News Feeds Aggregation Using Generators

Contents

Use Case: News Feeds Aggregation Using Generators

The Problem

Actors

Goals

Functional Requirements

Constraints / Assumptions / Preconditions

Assumptions

Constraint

Preconditions

Proposed High-Level Solution

A generator’s yield may be finite or infinite, and may be empty for a given generator instance without implying exhaustion of the underlying data source.

Known Approaches That Are Problematic

Benefits of the Generators Approach

End-to-End Flow

Remarks

Brief Description of the Core Processes in the Pipeline

Process 1 — Feed Fetching & Acquisition

Process 2 — Parsing & Normalization

Process 3 — Content Filtering & Exclusion Rules

Process 4 — Topic Classification

Process 5 — Similarity Analysis & Clustering

Process 6 — Ranking, Urgency, and Freshness Scoring

Process 7 — Summary Page Generation

Process 8 — Detailed Pages & Cross-Links

Notes on the Process Pipeline

Why This Fits the Generator Datatype Extremely Well

Alternative Flows

Alternative Flow 1 — Feed Temporarily Stops Producing New Items

Alternative Flow 2 — Partial Consumption of the Pipeline

Alternative Flow 3 — Editor Inserts or Reorders Items

Exception Flows

Exception Flow 1 — Feed Unreachable or Network Failure

Exception Flow 2 — Malformed Feed Data

Exception Flow 3 — Resource Exhaustion Risk

Postconditions

Functional Outcomes

Technical Guarantees

System State

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions