Skip to content

Improved code quality #91

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
[flake8]
ignore = E203, W503, F541
max-line-length = 120
max-doc-length = 120
max-complexity = 10
exclude = .venv,cookiecutter,.git,.local,.idea,.mypy_cache,.pytest_cache
per-file-ignores =
per-file-ignores = __init__.py:F401
21 changes: 21 additions & 0 deletions .github/workflows/pull_request.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: Test

on:
pull_request

jobs:
pull_request:
name: Test
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Setup Python Version
uses: actions/setup-python@v4
with:
python-version: 3.9
cache: 'pip' # caching pip dependencies
- name: Install Python dependencies
run: pip install -r requirements.dev.txt
- name: Run pre-commit
uses: pre-commit/[email protected]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload.

Source: https://semgrep.dev/r/yaml.github-actions.security.third-party-action-not-pinned-to-commit-sha.third-party-action-not-pinned-to-commit-sha


Cc @thypon @bcaller

2 changes: 1 addition & 1 deletion .github/workflows/push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
- name: Checkout
uses: actions/checkout@v3
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v1
uses: aws-actions/configure-aws-credentials@v2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reported by reviewdog 🐶
[semgrep] An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload.

Source: https://semgrep.dev/r/yaml.github-actions.security.third-party-action-not-pinned-to-commit-sha.third-party-action-not-pinned-to-commit-sha


Cc @thypon @bcaller

with:
aws-access-key-id: ${{ secrets.GDBP_AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.GDBP_AWS_SECRET_ACCESS_KEY }}
Expand Down
32 changes: 32 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.4.0
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- id: check-ast
- id: requirements-txt-fixer
- id: check-docstring-first

- repo: https://github.com/psf/black
rev: 23.7.0
hooks:
- id: black

- repo: https://github.com/PyCQA/flake8
rev: 6.0.0
hooks:
- id: flake8

- repo: https://github.com/PyCQA/isort
rev: 5.12.0
hooks:
- id: isort
args: ["--profile", "black"]

- repo: https://github.com/PyCQA/bandit
rev: 1.7.5
hooks:
- id: bandit
entry: bandit --quiet -r -x tests/ src/*.py
2 changes: 1 addition & 1 deletion SECURITY.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@

## Reporting a Vulnerability

See https://hackerone.com/brave for details.
See https://hackerone.com/brave for details.
83 changes: 47 additions & 36 deletions config.py
Original file line number Diff line number Diff line change
@@ -1,44 +1,55 @@
import os

# Disable uploads to S3. Useful when running locally or in CI.
NO_UPLOAD = os.getenv('NO_UPLOAD', None)
NO_DOWNLOAD = os.getenv('NO_DOWNLOAD', None)
NO_UPLOAD = os.getenv("NO_UPLOAD", None)
NO_DOWNLOAD = os.getenv("NO_DOWNLOAD", None)

PCDN_URL_BASE = os.getenv('PCDN_URL_BASE', 'https://pcdn.brave.software')
PUB_S3_BUCKET = os.getenv('PUB_S3_BUCKET', 'brave-today-cdn-development')
PCDN_URL_BASE = os.getenv("PCDN_URL_BASE", "https://pcdn.brave.software")
PUB_S3_BUCKET = os.getenv("PUB_S3_BUCKET", "brave-today-cdn-development")
# Canonical ID of the public S3 bucket
BRAVE_TODAY_CANONICAL_ID = os.getenv('BRAVE_TODAY_CANONICAL_ID', None)
BRAVE_TODAY_CLOUDFRONT_CANONICAL_ID = os.getenv('BRAVE_TODAY_CLOUDFRONT_CANONICAL_ID', None)

LANG_REGION_MODEL_MAP = os.getenv('LANG_REGION_MODEL_MAP', [
('en_US', "sentence-transformers/all-MiniLM-L6-v2"),
('en_CA', "sentence-transformers/all-MiniLM-L6-v2"),
('en_GB', "sentence-transformers/all-MiniLM-L6-v2"),
('es_ES', "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"),
('es_MX', "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"),
('pt_BR', "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"),
('ja_JP', "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"),
('de_DE', "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"),
('fr_FR', "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"),
('en_AU', "sentence-transformers/all-MiniLM-L6-v2"),
('en_IN', "sentence-transformers/all-MiniLM-L6-v2"),

])

SOURCES_JSON_FILE = os.getenv('SOURCES_JSON_FILE', 'sources.{LANG_REGION}')
FEED_JSON_FILE = os.getenv('FEED_JSON_FILE', 'feed.{LANG_REGION}')

OUTPUT_DIR = os.getenv('OUTPUT_DIR', 'output')

ARTICLE_HISTORY_FILE = os.getenv('ARTICLE_HISTORY_FILE', "articles_history.{LANG_REGION}.csv")
BRAVE_TODAY_CANONICAL_ID = os.getenv("BRAVE_TODAY_CANONICAL_ID", None)
BRAVE_TODAY_CLOUDFRONT_CANONICAL_ID = os.getenv(
"BRAVE_TODAY_CLOUDFRONT_CANONICAL_ID", None
)

LANG_REGION_MODEL_MAP = os.getenv(
"LANG_REGION_MODEL_MAP",
[
("en_US", "sentence-transformers/all-MiniLM-L6-v2"),
("en_CA", "sentence-transformers/all-MiniLM-L6-v2"),
("en_GB", "sentence-transformers/all-MiniLM-L6-v2"),
("es_ES", "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"),
("es_MX", "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"),
("pt_BR", "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"),
("ja_JP", "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"),
("de_DE", "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"),
("fr_FR", "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"),
("en_AU", "sentence-transformers/all-MiniLM-L6-v2"),
("en_IN", "sentence-transformers/all-MiniLM-L6-v2"),
],
)

SOURCES_JSON_FILE = os.getenv("SOURCES_JSON_FILE", "sources.{LANG_REGION}")
FEED_JSON_FILE = os.getenv("FEED_JSON_FILE", "feed.{LANG_REGION}")

OUTPUT_DIR = os.getenv("OUTPUT_DIR", "output")

ARTICLE_HISTORY_FILE = os.getenv(
"ARTICLE_HISTORY_FILE", "articles_history.{LANG_REGION}.csv"
)
# Don't compute the embedding for a source that has less than 30 collected articles
MINIMUM_ARTICLE_HISTORY_SIZE = os.getenv('MINIMUM_ARTICLE_HISTORY_SIZE', 30)
SIMILARITY_CUTOFF_RATIO = os.getenv('SIMILARITY_CUTOFF_RATIO', 0.9)
SOURCE_SIMILARITY_T10 = os.getenv('SOURCE_SIMILARITY_T10', "source_similarity_t10.{LANG_REGION}")
SOURCE_SIMILARITY_T10_HR = os.getenv('SOURCE_SIMILARITY_T10_HR', "source_similarity_t10_hr.{LANG_REGION}")

SOURCE_EMBEDDINGS = os.getenv('SOURCE_EMBEDDINGS', "SOURCE_EMBEDDINGS.{LANG_REGION}")

if SENTRY_URL := os.getenv('SENTRY_URL'):
MINIMUM_ARTICLE_HISTORY_SIZE = os.getenv("MINIMUM_ARTICLE_HISTORY_SIZE", 30)
SIMILARITY_CUTOFF_RATIO = os.getenv("SIMILARITY_CUTOFF_RATIO", 0.9)
SOURCE_SIMILARITY_T10 = os.getenv(
"SOURCE_SIMILARITY_T10", "source_similarity_t10.{LANG_REGION}"
)
SOURCE_SIMILARITY_T10_HR = os.getenv(
"SOURCE_SIMILARITY_T10_HR", "source_similarity_t10_hr.{LANG_REGION}"
)

SOURCE_EMBEDDINGS = os.getenv("SOURCE_EMBEDDINGS", "SOURCE_EMBEDDINGS.{LANG_REGION}")

if SENTRY_URL := os.getenv("SENTRY_URL"):
import sentry_sdk

sentry_sdk.init(dsn=SENTRY_URL, traces_sample_rate=0)
11 changes: 6 additions & 5 deletions embeddings.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,16 @@
logger = get_logger()


def compute_source_similarity(source_1, source_2, function='cosine'):
if function == 'dot':
def compute_source_similarity(source_1, source_2, function="cosine"):
if function == "dot":
return util.dot_score(source_1, np.transpose(source_2))
elif function == 'cosine':
elif function == "cosine":
return util.pytorch_cos_sim(source_1, source_2)[0][0]


def get_source_representation_from_titles(titles, model):
if len(titles) < config.MINIMUM_ARTICLE_HISTORY_SIZE:
return np.zeros((1, EMBEDDING_DIMENSIONALITY))
return np.zeros((1, EMBEDDING_DIMENSIONALITY))

return model.encode(titles).mean(axis=0)

Expand All @@ -27,5 +27,6 @@ def compute_source_representation_from_articles(articles_df, publisher_id, model
publisher_bucket_df = articles_df[articles_df.publisher_id == publisher_id]

titles = [
title for title in publisher_bucket_df.title.to_numpy() if title is not None]
title for title in publisher_bucket_df.title.to_numpy() if title is not None
]
return get_source_representation_from_titles(titles, model)
18 changes: 18 additions & 0 deletions renovate.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,23 @@
"$schema": "https://docs.renovatebot.com/renovate-schema.json",
"extends": [
"config:base"
],
"schedule": [
"every 7 days"
],
"baseBranches": [
"master"
],
"pre-commit": {
"enabled": true
},
"pip_requirements": {
"fileMatch": ["requirements.*"]
},
"packageRules": [
{
"packagePatterns": ["^regex$"],
"enabled": false
}
]
}
10 changes: 10 additions & 0 deletions requirements.dev.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
-r requirements.txt
bandit==1.7.5
black==23.7.0
bpython==0.24
flake8==6.1.0
isort==5.12.0
pip-check-reqs==2.4.4
pre-commit==3.3.3
pylint==2.17.5
pytest==7.4.0
6 changes: 3 additions & 3 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
boto3==1.26.14
botocore==1.29.14
feedparser==6.0.10
numpy==1.23.5
pandas==1.5.1
requests==2.31.0
scipy==1.10.0
sentence-transformers==2.2.2
sentry-sdk==1.29.2
tqdm==4.66.1
boto3==1.26.14
botocore==1.29.14
structlog==23.1.0
torch==2.0.1
torchvision==0.15.2
tqdm==4.66.1
transformers==4.31.0
53 changes: 37 additions & 16 deletions source-feed-accumulator.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,34 +12,52 @@


def sanitize_articles_history(lang_region):
articles_history_df = pd.read_csv(config.OUTPUT_DIR + config.ARTICLE_HISTORY_FILE.format(LANG_REGION=lang_region))
articles_history_df = pd.read_csv(
config.OUTPUT_DIR + config.ARTICLE_HISTORY_FILE.format(LANG_REGION=lang_region)
)
articles_history_df = articles_history_df.drop_duplicates().dropna()
articles_history_df.to_csv(config.OUTPUT_DIR + config.ARTICLE_HISTORY_FILE.format(LANG_REGION=lang_region), index=False)
articles_history_df.to_csv(
config.OUTPUT_DIR + config.ARTICLE_HISTORY_FILE.format(LANG_REGION=lang_region),
index=False,
)


def accumulate_articles(articles, lang_region):
for i, article in tqdm(enumerate(articles)):
title = article['title'].replace('\r', '').replace('\n', '').replace('"', '')
description = article['description'].replace('\r', '').replace('\n', '').replace('"', '')
publish_time = article['publish_time']
publisher_id = article['publisher_id']

with open(config.OUTPUT_DIR + config.ARTICLE_HISTORY_FILE.format(LANG_REGION=lang_region), "a") as f:
f.write('"' + '","'.join([title, description, publish_time, publisher_id]) + '"\n')
title = article["title"].replace("\r", "").replace("\n", "").replace('"', "")
description = (
article["description"].replace("\r", "").replace("\n", "").replace('"', "")
)
publish_time = article["publish_time"]
publisher_id = article["publisher_id"]

with open(
config.OUTPUT_DIR
+ config.ARTICLE_HISTORY_FILE.format(LANG_REGION=lang_region),
"a",
) as f:
f.write(
'"'
+ '","'.join([title, description, publish_time, publisher_id])
+ '"\n'
)


for lang_region, model in config.LANG_REGION_MODEL_MAP:
logger.info(f"Starting feeds accumulator for {lang_region}")

feed_file = f'{config.FEED_JSON_FILE.format(LANG_REGION=lang_region)}.json'
feed_file = f"{config.FEED_JSON_FILE.format(LANG_REGION=lang_region)}.json"

pathlib.Path(config.OUTPUT_DIR).mkdir(parents=True, exist_ok=True)

if not config.NO_DOWNLOAD:
download_file(feed_file, config.PUB_S3_BUCKET, f"brave-today/{feed_file}")
download_file(config.OUTPUT_DIR + config.ARTICLE_HISTORY_FILE.format(LANG_REGION=lang_region),
config.PUB_S3_BUCKET,
f"source-suggestions/{config.ARTICLE_HISTORY_FILE.format(LANG_REGION=lang_region)}")
download_file(
config.OUTPUT_DIR
+ config.ARTICLE_HISTORY_FILE.format(LANG_REGION=lang_region),
config.PUB_S3_BUCKET,
f"source-suggestions/{config.ARTICLE_HISTORY_FILE.format(LANG_REGION=lang_region)}",
)

with open(feed_file) as feeds:
feeds_data = json.loads(feeds.read())
Expand All @@ -51,8 +69,11 @@ def accumulate_articles(articles, lang_region):
sanitize_articles_history(lang_region)

if not config.NO_UPLOAD:
upload_file(config.OUTPUT_DIR + config.ARTICLE_HISTORY_FILE.format(LANG_REGION=lang_region),
config.PUB_S3_BUCKET,
f"source-suggestions/{config.ARTICLE_HISTORY_FILE.format(LANG_REGION=lang_region)}")
upload_file(
config.OUTPUT_DIR
+ config.ARTICLE_HISTORY_FILE.format(LANG_REGION=lang_region),
config.PUB_S3_BUCKET,
f"source-suggestions/{config.ARTICLE_HISTORY_FILE.format(LANG_REGION=lang_region)}",
)

logger.info("Finished sanitizing articles_history.")
Loading