-
Notifications
You must be signed in to change notification settings - Fork 20
Description
In response to:
QT4CG-147-02: NW to chase up DN and LQ about follow-up to the generator discussion
Use Case: News Feeds Aggregation Using Generators
Contents
Use Case: News Feeds Aggregation Using Generators
- Actors
- Goals
- Functional Requirements
- Constraints / Assumptions / Preconditions
- Proposed High-Level Solution
- Known Approaches that are Problematic
- Benefits of the Generators Approach
- End-to-End Flow
- Brief Description of the Core Processes in the Pipeline
- Notes on the Process Pipeline
- Why This Fits the Generator Datatype Extremely Well
- Alternative Flows
- Alternative Flow-1: A Feed Temporarily Stops Producing New Items
- Alternative Flow-2: Partial Consumption of the Pipeline
- Alternative Flow-3: Editor Inserts or Reorders Items 11
- Exception Flows
- Exception Flow-1: Feed Unreachable or Network Failure
- Exception Flow-2: Malformed Feed Data
- Exception Flow-3: Resource Exhaustion Risk
- Postconditions
- References
The Problem
Modern RSS/JSON aggregators must process hundreds of continuously updating feeds without excessive memory usage or latency, while supporting filtering, merging, and prioritization in real time.
Actors
- End-User
- Editor
- Administrator
- System components (internal processes acting as secondary actors)
- External services (RSS providers, APIs, social signals)
Goals
-
End-User
“As a user, I want to get the latest, up-to-the-minute news from many important sources. I want each brief news item to be presented with a link to more detailed information from the original source.” -
Editor
“As an editor, I want to be alerted to any change in the aggregated news-stream, as it happens continuously, and to have powerful ways of inserting, reordering, appending, prepending or deleting one or more news-items.” -
Administrator
“As an administrator, I want to start, stop, or restart the system, manage the configured feeds, and monitor operational health and error conditions.”
Functional Requirements
- Consume RSS / Atom / JSON-LD feeds incrementally
- Filter items by topic or sensitivity
- Merge multiple feeds chronologically
- Produce continuously updated summaries
Constraints / Assumptions / Preconditions
Assumptions
- Feeds may be large or unbounded
- Items arrive over time
Constraint
- Memory usage must remain bounded
Preconditions
- At least one news feed is configured
- Feeds are RSS or JSON-LD and timestamped
- Items within a feed are presented in reverse-chronological order
- Each item contains a content-link or optionally - inline content
- Items may belong to multiple categories
Proposed High-Level Solution
Each feed is modeled as a generator producing yield values lazily.
The ordered set of values produced by successive, demand-driven calls to move-next() is called the yield of the generator.
A generator’s yield may be finite or infinite, and may be empty for a given generator instance without implying exhaustion of the underlying data source.
Known Approaches That Are Problematic
These approaches require full materialization in memory:
- Eager sequences (XPath)
- DOM-style loading
- Materialized feeds
Benefits of the Generators Approach
- Bounded memory usage
- Low latency
- Composability
- Deterministic control of evaluation
End-to-End Flow
+-------------------------------+
| 1. Feed Fetching |
| Input: external providers |
| Output: G_rawItems |
+---------------+---------------+
|
+---------------v---------------+
| 2. Normalization |
| Input: G_rawItems |
| Output: G_normalizedItems |
+---------------+---------------+
|
+---------------v---------------+
| 3. Filtering | <-- unwanted content removed
| Input: G_normalizedItems |
| Output: G_filteredItems |
+---------------+---------------+
|
+---------------v---------------+
| 4. Topic Classification |
| Input: G_filteredItems |
| Output: G_classifiedItems |
+---------------+---------------+
|
+---------------v---------------+
| 5. Clustering |
| Input: G_classifiedItems |
| Output: G_clusteredItems |
+---------------+---------------+
|
+---------------v---------------+
| 6. Ranking |
| Input: G_clusteredItems |
| Output: G_rankedItems |
+---------------+---------------+
|
+---------------v---------------+
| 7. Summary Page Generation |
| Input: G_rankedItems |
| Output: G_summaryPageItems, |
| HTML |
+---------------+---------------+
|
+---------------v---------------+
| 8. Detail Page Generation |
| Input: G_summaryPageItems |
| Output: HTML Detail Pages |
+-------------------------------+
Remarks
- The participating generator instances are named using the convention
G_{name}. - Every stage except the final one produces a new generator.
- Every stage except the very first uses a generator as its input.
- Arrow semantics: the output generator of one stage is the input for the next stage.
Brief Description of the Core Processes in the Pipeline
Process 1 — Feed Fetching & Acquisition
Goal:
Continuously pull RSS / Atom / JSON-LD feeds from CNN, Fox, NBC, BBC, etc.
Includes:
- Periodic polling (e.g., every 5 minutes)
- Detection of new items (GUID, URL hash, published timestamps)
- N-way merging to ensure the resulting yield is sorted in reverse-chronological order
- Basic sanity validation (e.g., XML schema validity)
Output:
A generator whose yield values are raw feed items (XML / JSON documents) → input to Process 2.
Process 2 — Parsing & Normalization
Goal:
Convert heterogeneous raw feed items into a uniform internal format.
Normalized fields include:
- Title
- Description / Summary
- Full text (if available)
- URL
- Publication time (converted to UTC)
- Source
- Images, categories, tags
- Named entities (optional NLP-based enrichment)
Output:
A generator yielding clean, normalized NewsItem documents → input to Process 3.
Process 3 — Content Filtering & Exclusion Rules
Goal:
Remove unwanted items early using configurable rule sets.
Examples:
- Blocked topics: politics, celebrity gossip, violence, etc.
- Blocked entities: Donald Trump, Joe Biden, Kanye West, etc.
- Blocked publishers (optional)
- Expiration rules:
- Tech news stale after 48 hours
- Breaking news stale after 6 hours
Techniques:
- Keyword filtering
- Named Entity Recognition (NER)
- Sensitive-topic classifiers (ML-based)
- Freshness scoring
Output:
A generator yielding allowed, filtered NewsItem documents → input to Process 4.
Rejected items are stored separately for auditing.
Process 4 — Topic Classification
Goal:
Assign each item to one or more topics.
Example topics:
- Politics
- World
- Tech
- Health
- Sports
- Business
- Disasters / Urgent events
- Crime / Safety
- Entertainment
Approaches:
- Fine-tuned BERT classifier (preferred)
- TF-IDF + SVM (simpler)
- Feed-provided category tags (fallback)
Output:
A generator yielding categorized NewsItem documents → input to Process 5.
Process 5 — Similarity Analysis & Clustering
Goal:
Group news items from different sources describing the same event.
Techniques:
- Semantic vector embeddings (e.g., SBERT, Ada embeddings)
- Cosine similarity
- Hierarchical clustering or DBSCAN
Produces:
- Clusters of highly similar articles
- A primary (best) representative per cluster
Output:
A generator yielding clusters of related articles → input to Process 6.
Note:
To better match streaming behavior, clustering may operate within bounded windows (e.g., sliding windows) while still consuming the input generator.
Process 6 — Ranking, Urgency, and Freshness Scoring
Goal:
Prioritize which news appears on the Summary Page.
Computed scores:
- Freshness score (more recent → higher)
- Urgency score (disasters, crises, violence)
- Coverage score (number of sources reporting)
- Engagement score (optional: social signals)
Weighted formula:
FinalScore = a*Urgency + b*Freshness + c*Coverage + d*EditorRules
Items with the highest scores per topic are selected.
This stage does not require a full total ordering; instead a partial ordering (e.g., top-K per topic) preserves bounded memory.
Editor-driven operations (insert, remove, reorder) are modeled as generator transformations applied downstream of ranking.
Output:
A generator yielding ranked clusters → input to Process 7.
Process 7 — Summary Page Generation
This stage consumes the input generator and produces finite views intended for presentation.
Goal:
Build a continuously updated Summary Page (“Front Page”) containing:
- Top events per topic
- Short summaries
- Links to primary articles
- “Read similar news” (cluster siblings)
- Source icons
- Timestamp of most recent update
The page auto-refreshes and always reflects the newest items.
Process 8 — Detailed Pages & Cross-Links
This stage consumes its input generator and produces finite presentation views.
For each cluster:
- Canonical article (primary representative)
- Related articles across sources
- Timeline of developments
- Additional metadata (images, entities, tags)
Cross-links include:
- “More like this…”
- “Earlier developments…”
- “Follow-up stories…”
Notes on the Process Pipeline
- Feed Fetching typically wraps one or more data providers
→ producesG_rawItemslazily (RSS, JSON APIs, DB cursors, web services) - Every stage is expressible as:
for-each,filter,append,prepend,insert-at,remove-where,concat, orfold, etc., producing a new generator derived from the previous one
- No stage requires full materialization unless explicitly demanded
(e.g.,to-array, bounded sort, pagination) - Infinite generators are valid until stage 6; stages 7–8 typically consume finite prefixes (
take(n))
Why This Fits the Generator Datatype Extremely Well
- The pipeline is a composition of generator transformers
- Each box maps almost 1-to-1 to generator operations
- External data providers integrate naturally at Stage 1
- Sorting can be introduced in different ways:
- External merge-sort over generators
- Bounded-window ranking
- Top-K lazy ranking – e.g. using heaps.
Alternative Flows
Alternative Flow 1 — Feed Temporarily Stops Producing New Items
Condition:
A feed is reachable but has no new items since the last polling cycle.
Flow:
- The feed generator advances (
move-next()). - The data provider returns no new items.
- The feed-generator instance yields no items during this interval.
- Downstream generators remain operational.
- If all feeds are empty, no new items are added downstream.
Result:
The pipeline continues uninterrupted; no special handling is required.
Alternative Flow 2 — Partial Consumption of the Pipeline
Condition:
Only a finite prefix of the stream is required (e.g., top N items).
Flow:
- Downstream consumers apply
take(N). - Upstream generators are evaluated only as needed.
- Remaining potential yield values are never materialized.
Result:
Latency and memory usage remain bounded. The pipeline supports early termination naturally.
Alternative Flow 3 — Editor Inserts or Reorders Items
Condition:
An editor manually modifies the aggregated stream.
Flow:
- Editor operations are applied as generator transformations
(append,prepend,insert-at,remove-at,remove-where). - A new generator with the modified yield is produced.
- Downstream stages consume it transparently.
Result:
Editorial control integrates seamlessly without breaking the pipeline.
Exception Flows
Exception Flow 1 — Feed Unreachable or Network Failure
Condition:
A feed cannot be reached during polling.
Flow:
- The data provider reports an error or timeout.
- The next instance of the feed generator yields no items during this polling interval.
- The error is logged for monitoring.
- A retry policy (e.g., exponential backoff) is applied.
Result:
The system continues operating with remaining feeds.
Exception Flow 2 — Malformed Feed Data
Condition:
A feed item is malformed (invalid XML/JSON or schema validation problems, e.g. missing required fields).
Flow:
- The normalization stage detects the issue.
- The item is discarded or quarantined.
- Processing continues with subsequent items.
Result:
Malformed data does not propagate downstream.
Exception Flow 3 — Resource Exhaustion Risk
Condition:
A downstream operation risks exceeding memory limits.
Flow:
- Bounded strategies (windowing, top-K selection) are applied.
- Full materialization is avoided.
- If needed, the operation degrades gracefully (e.g., reduced clustering depth).
Result:
System stability is preserved under load.
Postconditions
Upon successful execution:
Functional Outcomes
- End users see an up-to-date Summary Page.
- Each summary item links to a Detailed Page.
- Editors can intervene using generator operations.
- Administrators retain full system control.
Technical Guarantees
- Memory usage remains bounded.
- Latency is minimized through lazy evaluation.
- Full materialization occurs only when explicitly requested.
System State
- All generators remain composable.
- Generator composition remains valid after alternative and exceptional flows.
- Empty generators correctly represent exhaustion.
- Infinite yields are supported up to stages that require finiteness.
References
-
RSS 2.0 Specification
https://www.rssboard.org/rss-specification -
Atom Publishing Protocol (RFC 5023)
https://www.rfc-editor.org/rfc/rfc5023 -
JSON-LD Specification
https://json-ld.org/spec/ -
TF-IDF, “Understanding TF-IDF (Term Frequency-Inverse Document Frequency)”,
https://www.geeksforgeeks.org/machine-learning/understanding-tf-idf-term-frequency-inverse-document-frequency/ -
TF-IDF + SVM, “Strengthening Fake News Detection: Leveraging SVM and Sophisticated Text Vectorization Techniques. Defying BERT?”,
https://arxiv.org/html/2411.12703v1 -
Sentence-BERT (SBERT)
Reimers, N. & Gurevych, I., 2019
https://arxiv.org/abs/1908.10084 -
Fine-tuned BERT, “Fine-tuning a BERT model”,
https://www.tensorflow.org/tfmodels/nlp/fine_tune_bert -
Ada Embeddings (OpenAI)
Radford et al., 2021
https://arxiv.org/abs/2103.00020 -
Cosine Similarity
https://en.wikipedia.org/wiki/Cosine_similarity -
Hierarchical Clustering
https://en.wikipedia.org/wiki/Hierarchical_clustering -
DBSCAN
Ester et al., 1996
https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf