Skip to content

Commit 985f334

Browse files
authored
Merge pull request #153 from brave/dep-updates-and-logging
Dep updates, logging, Makefile, local dev updates
2 parents 4ee6d36 + 4e7a8cc commit 985f334

7 files changed

+124
-15
lines changed

.gitignore

+5
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,8 @@ __pycache__/
1111
/sources.csv.json
1212
.pytest_cache/*
1313
.idea/
14+
articles_history.en_US.csv
15+
source_similarity_t10.en_US.json
16+
source_similarity_t10_hr.en_US.json
17+
feed.en_US.json
18+
sources.en_US.json

Makefile

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
create-local-env:
2+
./local-env.sh

README.md

+25-7
Original file line numberDiff line numberDiff line change
@@ -16,17 +16,35 @@ pip install -r requirements.txt
1616
- `paraphrase-multilingual-MiniLM-L12-v2` for non-english language sources.
1717
Once all source embeddings are generated, a pairwise source similarity matrix is produced.
1818

19+
20+
21+
## Description
22+
There are two jobs involved in the generation of source suggestions. You can list them in EKS `source-suggestions-prod`.
23+
24+
25+
- **feed-accumulator** is a job that runs hourly that will fetch the feed.json for each locale then accumulate them all into a csv file then write back to S3. This output is available here https://brave-today-cdn.brave.com/source-suggestions/articles_history.en_US.csv and here. The articles_history file is only used by the backend job source-sim-matrix, the client does not use it.
26+
27+
28+
- **source-sim-matrix** is the other job, runs twice a week which will pull the articles_history csv and the publishers json from S3 then perform clustering on the article text and produce the source-suggestions json for each locale:
29+
- https://brave-today-cdn.brave.com/source-suggestions/source_similarity_t10.en_US.json
30+
- https://brave-today-cdn.brave.com/source-suggestions/source_similarity_t10_hr.en_US.json.
31+
32+
Non English locales use a multilingual clustering model. The browser will use this file to then determine which publishers to show in the suggested publisher cards in the feed, about every 7-8 cards you will see the suggestions.
33+
34+
1935
## Running locally
2036
To collect and accumulate article history:
37+
38+
Run this to download the files needed to run the script locally:
39+
```sh
40+
make create-local-env
2141
```
22-
export NO_UPLOAD=1
23-
export NO_DOWNLOAD=1
24-
python source-feed-accumulator.py
42+
43+
```sh
44+
NO_UPLOAD=1 NO_DOWNLOAD=1 python source-feed-accumulator.py
2545
```
2646

2747
To computed source embeddings and produce the source similarity matrix:
28-
```
29-
export NO_UPLOAD=1
30-
export NO_DOWNLOAD=1
31-
python sources-similarity-matrix.py
48+
```sh
49+
NO_UPLOAD=1 NO_DOWNLOAD=1 python source-similarity-matrix.py
3250
```

embeddings.py

+35-2
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
import numpy as np
22
from sentence_transformers import util
33
from structlog import get_logger
4+
import time # Add import for timing
45

56
import config
67

@@ -17,15 +18,47 @@ def compute_source_similarity(source_1, source_2, function='cosine'):
1718

1819

1920
def get_source_representation_from_titles(titles, model):
20-
if len(titles) < config.MINIMUM_ARTICLE_HISTORY_SIZE:
21+
num_titles = len(titles)
22+
logger.info("get_source_representation_from_titles called", num_titles=num_titles)
23+
24+
if num_titles < config.MINIMUM_ARTICLE_HISTORY_SIZE:
25+
logger.warn(
26+
"Not enough titles for source representation",
27+
num_titles=num_titles,
28+
min_required=config.MINIMUM_ARTICLE_HISTORY_SIZE
29+
)
2130
return np.zeros((1, EMBEDDING_DIMENSIONALITY))
2231

23-
return model.encode(titles).mean(axis=0)
32+
start_time = time.time()
33+
embeddings = model.encode(titles)
34+
end_time = time.time()
35+
logger.info(
36+
"Model encoding finished",
37+
num_titles=num_titles,
38+
duration_sec=round(end_time - start_time, 3)
39+
)
40+
41+
return embeddings.mean(axis=0)
2442

2543

2644
def compute_source_representation_from_articles(articles_df, publisher_id, model):
45+
logger.info(
46+
"compute_source_representation_from_articles called",
47+
publisher_id=publisher_id,
48+
dataframe_shape=articles_df.shape
49+
)
50+
51+
start_time = time.time()
2752
publisher_bucket_df = articles_df[articles_df.publisher_id == publisher_id]
53+
end_time = time.time()
54+
logger.info(
55+
"DataFrame filtering finished",
56+
publisher_id=publisher_id,
57+
duration_sec=round(end_time - start_time, 3),
58+
filtered_shape=publisher_bucket_df.shape
59+
)
2860

2961
titles = [
3062
title for title in publisher_bucket_df.title.to_numpy() if title is not None]
63+
# Pass the model to the helper function for encoding
3164
return get_source_representation_from_titles(titles, model)

local-env.sh

+52
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
#!/bin/bash
2+
3+
# Remove existing virtual environment
4+
rm -rf .venv
5+
6+
# Create a new virtual environment
7+
python3 -m venv .venv
8+
9+
# Activate the virtual environment
10+
source .venv/bin/activate
11+
12+
# Ensure the correct Python version is being used
13+
pyenv global 3.9.11
14+
eval "$(pyenv init --path)"
15+
16+
# Install the required packages
17+
echo "Install requirements"
18+
pip install -r requirements.txt
19+
20+
# Print completion messages
21+
echo "---------------------------"
22+
echo ".venv recreated and sourced"
23+
echo "Set python version to 3.9.11"
24+
echo "Installed requirements"
25+
echo "Complete"
26+
echo "---------------------------"
27+
28+
# download these files
29+
urls=(
30+
"https://brave-today-cdn.brave.com/brave-today/feed.en_US.json"
31+
"https://brave-today-cdn.brave.com/source-suggestions/articles_history.en_US.csv"
32+
"https://brave-today-cdn.brave.com/sources.en_US.json"
33+
)
34+
35+
for url in "${urls[@]}"; do
36+
# Extract filename from URL
37+
filename=$(basename "$url")
38+
39+
# Download the file using wget
40+
wget -O "$filename" "$url"
41+
42+
# Check if download was successful
43+
if [ $? -eq 0 ]; then
44+
echo "Successfully downloaded: $filename"
45+
else
46+
echo "Failed to download: $filename"
47+
fi
48+
done
49+
50+
51+
# Keep the virtual environment active
52+
exec "$SHELL"

requirements.txt

+3-4
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,11 @@ numpy==1.23.5
33
pandas==1.5.1
44
requests==2.32.3
55
scipy==1.10.0
6-
sentence-transformers==2.7.0
7-
sentry-sdk==2.8.0
6+
sentence-transformers==3.0.1
7+
sentry-sdk==1.45.0
88
tqdm==4.66.4
99
boto3==1.26.14
1010
botocore==1.29.14
1111
structlog==23.3.0
1212
torch==2.6.0
13-
torchvision==0.21.0
14-
transformers==4.51.3
13+
transformers==4.48.0

source-feed-accumulator.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,8 @@
1414
def sanitize_articles_history(lang_region):
1515
articles_history_df = pd.read_csv(config.OUTPUT_DIR + config.ARTICLE_HISTORY_FILE.format(LANG_REGION=lang_region))
1616
articles_history_df = articles_history_df.drop_duplicates().dropna()
17-
cutoff_date = pd.Timestamp.now().normalize() - pd.Timedelta(days=3*31)
18-
# purge articles older than 3 months
17+
cutoff_date = pd.Timestamp.now().normalize() - pd.Timedelta(days=2*31)
18+
# purge articles older than 2 months
1919
articles_history_df = articles_history_df[pd.to_datetime(
2020
articles_history_df.iloc[:, 2]) > cutoff_date]
2121
articles_history_df.to_csv(config.OUTPUT_DIR + config.ARTICLE_HISTORY_FILE.format(LANG_REGION=lang_region), index=False)

0 commit comments

Comments
 (0)