Skip to content

Deduplicate articles #1193

Open
Open
@twm

Description

@twm

Common causes:

  • Change of generator (most common)
  • HTTP → HTTPS transition (not so often these days)
  • Domain name change (e.g. /feed/361/all/ or /feed/504/all/)
  • /feed/633/all/

This should be addressed by introducing a post-processing step as discussed in #415. Dataflow something like:

graph LR

F(Feed) --> R1(RawArticle)
F --> R2(RawArticle)
F --> R3(RawArticle)
F --> R4(RawArticle)

R1 --> A1(Article)
R2 --> A1(Article)
R3 --> A3(Article)
R4 --> A4[\"(filtered out)"/]
Loading

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    To do

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions