You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-`paraphrase-multilingual-MiniLM-L12-v2` for non-english language sources.
17
17
Once all source embeddings are generated, a pairwise source similarity matrix is produced.
18
18
19
+
20
+
21
+
## Description
22
+
There are two jobs involved in the generation of source suggestions. You can list them in EKS `source-suggestions-prod`.
23
+
24
+
25
+
-**feed-accumulator** is a job that runs hourly that will fetch the feed.json for each locale then accumulate them all into a csv file then write back to S3. This output is available here https://brave-today-cdn.brave.com/source-suggestions/articles_history.en_US.csv and here. The articles_history file is only used by the backend job source-sim-matrix, the client does not use it.
26
+
27
+
28
+
-**source-sim-matrix** is the other job, runs twice a week which will pull the articles_history csv and the publishers json from S3 then perform clustering on the article text and produce the source-suggestions json for each locale:
Non English locales use a multilingual clustering model. The browser will use this file to then determine which publishers to show in the suggested publisher cards in the feed, about every 7-8 cards you will see the suggestions.
33
+
34
+
19
35
## Running locally
20
36
To collect and accumulate article history:
37
+
38
+
Run this to download the files needed to run the script locally:
0 commit comments