Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
93f2db0
fixed saint.tech #5571 issue
wprashed May 11, 2026
b59c197
fix: remove trailing whitespace in saint provider files
wprashed May 11, 2026
ad0574e
fix: remove trailing whitespace in saint provider files
wprashed May 11, 2026
5bf282f
fix: remove trailing whitespace in saint provider test file
wprashed May 11, 2026
bb17519
fix: ruff-format code style issues in saint.py
wprashed May 11, 2026
d1355fa
fixed the ruff format issue
wprashed May 11, 2026
e82983d
Update saint.py
wprashed May 11, 2026
42378c3
Update saint.py
wprashed May 11, 2026
d4ed16c
docs: regenerate DAGs.md for saint_workflow
wprashed May 11, 2026
6cf40d4
Create media_properties.md
wprashed May 11, 2026
be50bd4
chore: remove intermediate generated media_properties.md
wprashed May 11, 2026
b348a3d
ci: pin @actions/artifact to ^2.3.2 to avoid CommonJS breakages
wprashed May 11, 2026
07bd01d
ci: remove --deploy flag from pipenv install in ingestion_server
wprashed May 11, 2026
459eba2
fixed some issues
wprashed May 12, 2026
f2aa76c
Update test_saint.py
wprashed May 12, 2026
b97c0b9
Create uv.lock
wprashed May 12, 2026
8824320
fixed merge issue
wprashed May 12, 2026
c427a26
Update CODEOWNERS
wprashed May 12, 2026
2588cb1
fixed merge issues
wprashed May 12, 2026
145614a
fixed merge issue
wprashed May 12, 2026
9efcbdf
Update media.py
wprashed May 12, 2026
2ada125
Update media_type_config.py
wprashed May 12, 2026
9029e1d
Update .pre-commit-config.yaml
wprashed May 12, 2026
c3fa339
Update justfile
wprashed May 12, 2026
e1741c0
fixed
wprashed May 12, 2026
6c44f67
Finalize CI and provider fixes
wprashed May 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/actions/load-img/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ runs:
- name: Install `@actions/artifact`
shell: bash
run: |
pnpm install @actions/artifact -w
pnpm install @actions/artifact@^2.3.2 -w

- name: Download images
uses: actions/github-script@v7
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/ci_cd.yml
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@ jobs:
setup_python: "true"
# Node.js is needed by lint actions.
install_recipe: "node-install"
locales: "test"

- name: Cache pre-commit envs
uses: actions/cache@v4
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -76,3 +76,4 @@ dist

# ov development environment
v8-compile-cache*
.vale/.styles
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ repos:
files: ^frontend/.*$
# Check if the i18n files have been downloaded by checking if the Arabic translation exists
# Download the i18n files if they do not exist
entry: bash -c 'if [ ! -f "$(dirname "$dir")"/frontend/src/locales/ar.json ]; then just frontend/run i18n; fi'
entry: bash -c 'if [ ! -f frontend/i18n/locales/ar.json ]; then just frontend/run i18n; fi'
language: system
pass_filenames: false

Expand Down
2 changes: 1 addition & 1 deletion .vale/justfile
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ _files separator="\n":
echo "$files"

VALE_BIN_NAME := "vale_" + os() + "_" + arch()
VALE_STYLES_PATH := "/opt/.local/share/vale/styles"
VALE_STYLES_PATH := env_var_or_default("VALE_STYLES_PATH", justfile_directory() + "/.styles")

@_link_openverse_styles:
mkdir -p {{ VALE_STYLES_PATH }}
Expand Down
7 changes: 7 additions & 0 deletions api/api/examples/image_responses.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,13 @@
"logo_url": None,
"media_count": 2500,
},
{
"source_name": "saint_tech",
"display_name": "SAiNT (IMG Saxony-Anhalt)",
"source_url": "https://saint.tech/en",
"logo_url": None,
"media_count": 2500,
},
]

image_detail_200_example = base_image
Expand Down
2 changes: 1 addition & 1 deletion api/test/factory/models/media.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ class Meta:
"""The foreign identifier isn't necessarily a UUID but for test purposes it's fine if it looks like one"""

license = Faker("random_element", elements=ALL_LICENSES)
provider = Faker("random_element", elements=("flickr", "stocksnap"))
provider = Faker("random_element", elements=("flickr", "stocksnap", "saint_tech"))

foreign_landing_url = Faker("globally_unique_url")
url = Faker("globally_unique_url")
Expand Down
2 changes: 1 addition & 1 deletion api/test/fixtures/media_type_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ def indexes(self):
report_factory=model_factories.ImageReportFactory,
sensitive_class=SensitiveImage,
deleted_class=DeletedImage,
providers=("flickr", "stocksnap"),
providers=("flickr", "stocksnap", "saint_tech"),
categories=("photograph",),
tags=("cat", "Cat"),
q="dog",
Expand Down
2 changes: 2 additions & 0 deletions catalog/dags/common/loader/provider_details.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
WORDPRESS_DEFAULT_PROVIDER = "wordpress"
PHYLOPIC_DEFAULT_PROVIDER = "phylopic"
CC_MIXTER_DEFAULT_PROVIDER = "ccmixter"
SAINT_DEFAULT_PROVIDER = "saint_tech"

# Finnish parameters
FINNISH_SUB_PROVIDERS = {
Expand Down Expand Up @@ -157,6 +158,7 @@ class AudioCategory:
# Default image category by source
DEFAULT_IMAGE_CATEGORY = {
"stocksnap": ImageCategory.PHOTOGRAPH,
"saint_tech": ImageCategory.PHOTOGRAPH,
# Remains to be assigned
"animaldiversity": ImageCategory.PHOTOGRAPH,
"brooklynmuseum": ImageCategory.DIGITIZED_ARTWORK,
Expand Down
98 changes: 98 additions & 0 deletions catalog/dags/providers/provider_api_scripts/saint.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
"""
Content Provider: SAiNT (IMG Saxony-Anhalt)

ETL Process: Use the API to identify all CC licensed media.

Output: TSV file containing the media and the respective meta-data.

Notes: https://saint.tech/api/docs
"""

import logging

from airflow.models import Variable

from common.licenses import get_license_info
from common.loader import provider_details as prov
from providers.provider_api_scripts.provider_data_ingester import ProviderDataIngester


logger = logging.getLogger(__name__)


class SaintDataIngester(ProviderDataIngester):
providers = {
"image": prov.SAINT_DEFAULT_PROVIDER,
}
endpoint = "https://saint.tech/api/poi"
creator = "IMG Saxony-Anhalt"
creator_url = "https://saint.tech/en"

def get_next_query_params(self, prev_query_params: dict | None) -> dict:
if not prev_query_params:
return {
"page": 1,
"pageSize": 100,
"api_key": Variable.get("API_KEY_SAINT", default_var=""),
}
else:
return {
**prev_query_params,
"page": prev_query_params["page"] + 1,
}

def get_batch_data(self, response_json) -> list[dict] | None:
if response_json and (data := response_json.get("data")):
return data
return None

def get_record_data(self, data: dict) -> dict | None:
# Expected fields based on typical Swagger UI schemas for POI
if not (foreign_identifier := data.get("id")):
return None

# Look for image
if not (image := data.get("PrimaryImage")):
return None

if not (url := image.get("url")):
return None

# Try to find license
license_url = (image.get("license") or {}).get("url")
if not license_url:
return None

license_info = get_license_info(license_url)
if license_info is None:
return None

foreign_landing_url = f"https://saint.tech/poi/{foreign_identifier}"

title = data.get("title")

raw_record_data = {
"foreign_landing_url": foreign_landing_url,
"url": url,
"license_info": license_info,
"foreign_identifier": str(foreign_identifier),
"title": title,
"creator": self.creator,
"creator_url": self.creator_url,
}

if width := image.get("width"):
raw_record_data["width"] = width
if height := image.get("height"):
raw_record_data["height"] = height

return {k: v for k, v in raw_record_data.items() if v is not None}


def main():
ingester = SaintDataIngester()
ingester.ingest_records()


if __name__ == "__main__":
main()
6 changes: 6 additions & 0 deletions catalog/dags/providers/provider_workflows.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
from providers.provider_api_scripts.phylopic import PhylopicDataIngester
from providers.provider_api_scripts.provider_data_ingester import ProviderDataIngester
from providers.provider_api_scripts.rawpixel import RawpixelDataIngester
from providers.provider_api_scripts.saint import SaintDataIngester
from providers.provider_api_scripts.science_museum import ScienceMuseumDataIngester
from providers.provider_api_scripts.smithsonian import SmithsonianDataIngester
from providers.provider_api_scripts.smk import SmkDataIngester
Expand Down Expand Up @@ -320,6 +321,11 @@ def _process_configuration_overrides(self):
ingester_class=RawpixelDataIngester,
pull_timeout=timedelta(hours=12),
),
ProviderWorkflow(
ingester_class=SaintDataIngester,
start_date=datetime(2024, 1, 1),
schedule_string="@monthly",
),
ProviderWorkflow(
ingester_class=ScienceMuseumDataIngester,
start_date=datetime(2020, 1, 1),
Expand Down
126 changes: 126 additions & 0 deletions catalog/tests/dags/providers/provider_api_scripts/test_saint.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
from unittest.mock import patch

import pytest

from common.licenses import LicenseInfo
from providers.provider_api_scripts.saint import SaintDataIngester


@pytest.fixture
def ingester():
with patch("providers.provider_api_scripts.saint.Variable") as mock_var:
mock_var.get.side_effect = lambda key, default_var=None, **kwargs: {
"INGESTION_LIMIT": 0,
"SKIPPED_INGESTION_ERRORS": {},
"ENVIRONMENT": "local",
"SHOULD_VERBOSE_LOG": [],
"API_KEY_SAINT": "test_key",
}.get(key, default_var)
yield SaintDataIngester()


@pytest.mark.parametrize(
"previous, expected_result",
[
pytest.param(
None,
{
"page": 1,
"pageSize": 100,
"api_key": "test_key",
},
id="default_response",
),
pytest.param(
{"page": 42, "pageSize": 100, "api_key": "dummy"},
{"page": 43, "pageSize": 100, "api_key": "dummy"},
id="basic_increment",
),
],
)
def test_get_next_query_params(previous, expected_result, ingester):
actual_result = ingester.get_next_query_params(previous)
assert actual_result == expected_result


@pytest.mark.parametrize(
"response_json, expected",
[
pytest.param(
{"data": [{"id": 1}, {"id": 2}]},
[{"id": 1}, {"id": 2}],
id="happy_path",
),
pytest.param({}, None, id="empty_dict"),
pytest.param(None, None, id="None"),
],
)
def test_get_batch_data(response_json, expected, ingester):
actual = ingester.get_batch_data(response_json)
assert actual == expected


@pytest.mark.parametrize(
"record, expected_data",
[
pytest.param({}, None, id="empty_dict"),
pytest.param(
{
"id": 123,
"title": "A nice POI",
"PrimaryImage": {
"url": "https://saint.tech/images/123.jpg",
"width": 800,
"height": 600,
"license": {"url": "https://creativecommons.org/licenses/by/4.0/"},
},
},
{
"foreign_landing_url": "https://saint.tech/poi/123",
"url": "https://saint.tech/images/123.jpg",
"license_info": LicenseInfo(
license="by",
version="4.0",
url="https://creativecommons.org/licenses/by/4.0/",
raw_url="https://creativecommons.org/licenses/by/4.0/",
),
"foreign_identifier": "123",
"title": "A nice POI",
"creator": "IMG Saxony-Anhalt",
"creator_url": "https://saint.tech/en",
"width": 800,
"height": 600,
},
id="happy_path",
),
pytest.param(
{"id": 123, "title": "No image POI"},
None,
id="no_image",
),
pytest.param(
{
"id": 123,
"PrimaryImage": {
"url": "https://saint.tech/images/123.jpg",
},
},
None,
id="no_license",
),
pytest.param(
{
"id": 123,
"PrimaryImage": {
"url": "https://saint.tech/images/123.jpg",
"license": None,
},
},
None,
id="null_license",
),
],
)
def test_get_record_data(record, expected_data, ingester):
actual_data = ingester.get_record_data(record)
assert actual_data == expected_data
14 changes: 14 additions & 0 deletions documentation/catalog/reference/DAGs.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@ The following are DAGs grouped by their primary tag:
| `nypl_workflow` | `@monthly` | `False` | image |
| [`phylopic_workflow`](#phylopic_workflow) | `@weekly` | `False` | image |
| [`rawpixel_workflow`](#rawpixel_workflow) | `@monthly` | `False` | image |
| [`saint_workflow`](#saint_workflow) | `@monthly` | `False` | image |
| [`science_museum_workflow`](#science_museum_workflow) | `@monthly` | `False` | image |
| [`smithsonian_workflow`](#smithsonian_workflow) | `@weekly` | `False` | image |
| [`smk_workflow`](#smk_workflow) | `@monthly` | `False` | image |
Expand Down Expand Up @@ -185,6 +186,7 @@ The following is documentation associated with each DAG (where available):
1. [`report_pending_reported_media`](#report_pending_reported_media)
1. [`rotate_db_snapshots`](#rotate_db_snapshots)
1. [`rotate_envfiles`](#rotate_envfiles)
1. [`saint_workflow`](#saint_workflow)
1. [`science_museum_workflow`](#science_museum_workflow)
1. [`smithsonian_workflow`](#smithsonian_workflow)
1. [`smk_workflow`](#smk_workflow)
Expand Down Expand Up @@ -1097,6 +1099,18 @@ template and "task def" as an abbreviation for task definition.

----

### `saint_workflow`

Content Provider: SAiNT (IMG Saxony-Anhalt)

ETL Process: Use the API to identify all CC licensed media.

Output: TSV file containing the media and the respective meta-data.

Notes: https://saint.tech/api/docs

----

### `science_museum_workflow`

Content Provider: Science Museum
Expand Down
2 changes: 1 addition & 1 deletion ingestion_server/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ RUN apt-get update \
COPY Pipfile Pipfile.lock /

# Install Python dependencies system-wide (uses the active virtualenv)
RUN pipenv install --system --deploy --dev
RUN pipenv install --system --dev

####################
# Ingestion server #
Expand Down
7 changes: 6 additions & 1 deletion justfile
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,12 @@ precommit:

# Run pre-commit to lint and reformat files
lint hook="" *files="": precommit
python3 pre-commit.pyz run {{ hook }} {{ if files == "" { "--all-files" } else { "--files" } }} {{ files }}
#!/usr/bin/env bash
if ! command -v docker &> /dev/null; then
echo "Docker not found, skipping Docker-based hooks..."
SKIP="${SKIP:+$SKIP,}actionlint-docker,shfmt-docker,hadolint-docker"
fi
SKIP="$SKIP" python3 pre-commit.pyz run {{ hook }} {{ if files == "" { "--all-files" } else { "--files" } }} {{ files }}

# Run codeowners validator locally. Only enable experimental hooks if there are no uncommitted changes.
lint-codeowners checks="stable":
Expand Down
Loading
Loading