Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
e9bcad0
Model update to match the schema 3.0 plan
dpopleton Oct 9, 2025
7c7f14d
Model correction and a couple temporary scripts
dpopleton Oct 14, 2025
e505efc
Model correction
dpopleton Oct 14, 2025
76f2a8e
Predeletion commit of sqlite testing
dpopleton Oct 16, 2025
836cddd
Predeletion commit of sqlite testing
dpopleton Oct 16, 2025
52512a3
Full conversion of tests to sqlite3
dpopleton Oct 16, 2025
a498e7a
Vastly improved testing
dpopleton Oct 23, 2025
b7c2cb4
fixed single failing test.
dpopleton Oct 24, 2025
3948ca6
Removed accidental space
dpopleton Oct 24, 2025
3bc24c5
tol_id moved from assembly to organism
dpopleton Oct 29, 2025
3e86ff5
Update to match Jorge's suggestions. Initial search outline
dpopleton Nov 3, 2025
3325c60
Update to match Jorge's suggestions. Initial search outline
dpopleton Nov 6, 2025
29fcba7
Minor updates that tests passed for. Then integrated data was added t…
dpopleton Nov 11, 2025
aae15b3
Fixed integrated tests for everything but grpc
dpopleton Nov 24, 2025
379dffe
Fixed integrated tests for everything but grpc
dpopleton Nov 26, 2025
3846a3f
Updated script behaviour of exports
dpopleton Nov 27, 2025
b5ca4b0
Minor fix of ftp_index.py
dpopleton Nov 27, 2025
59aa727
Initial search build
dpopleton Nov 28, 2025
2d4d57b
Integrated vs Partial release choice
mira13 Nov 30, 2025
4753235
Fixed test test_fetch_with_celegans_all_args, removed cnf.allow_unrel…
mira13 Dec 1, 2025
927298b
Comments and improvements for the current relese chioce func
mira13 Dec 1, 2025
d2c00eb
Genome moved to genome status variable, cnf allow_unreleased removed
mira13 Dec 3, 2025
898e7aa
Test fixes, 56 to go
mira13 Dec 3, 2025
429c09d
Allow_unreleased and is current removed from tests parmaters
mira13 Dec 3, 2025
5318b1b
Fixed test_fetch_genomes_by_genome_uuid
mira13 Dec 4, 2025
d25ca4e
More allow unreleased removed and test count adjusted
mira13 Dec 4, 2025
5ecdada
Fixed test_fetch_genome_uuid and typo
mira13 Dec 4, 2025
9ff5630
Fixed test_fetch_genome_dataset_by_organism_uuid
mira13 Dec 4, 2025
b0b04d6
Fixed test_fetch_related_assemblies_count test_fetch_related_assembli…
mira13 Dec 4, 2025
9c2a78a
Test fix complete
mira13 Dec 4, 2025
6dc6f08
Minor schema change. Full introduction of species search with taxonomy
dpopleton Dec 4, 2025
1962b25
Merge remote-tracking branch 'origin/update/schema3' into update/schema3
dpopleton Dec 4, 2025
d354a9d
Made the test pass. With force
dpopleton Dec 4, 2025
d3fb65b
fix gitlab ci cd pipeline (similar to PR174)
bilalebi Dec 8, 2025
bee45e2
Merge branch 'main' into update/schema3
bilalebi Dec 8, 2025
cb1be68
fix test_normalize_species_name broken test
bilalebi Dec 8, 2025
fe85e1c
minor gRPC fix
bilalebi Dec 8, 2025
c3f5be8
Removed old code that is unused
dpopleton Dec 11, 2025
edf0eff
Merge remote-tracking branch 'origin/update/schema3' into update/schema3
dpopleton Dec 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 1 addition & 8 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
@@ -1,18 +1,11 @@
# .gitlab-ci.yml
image: python:3.11

variables:
MYSQL_ROOT_PASSWORD: ""
MYSQL_ALLOW_EMPTY_PASSWORD: "yes"

services:
- mysql:8.0

stages:
- test

before_script:
- mysql -h mysql -u root -e "SET GLOBAL local_infile=1;"
- python -m pip install --upgrade pip
- pip install .[test]

Expand All @@ -24,7 +17,7 @@ test:
image: python:${PYTHON_VERSION}
script:
- echo "DB_HOST $METADATA_URI $TAXONOMY_URI"
- coverage run -m pytest -c pyproject.toml --server mysql://root@mysql:3306
- coverage run -m pytest -c pyproject.toml
coverage: '/TOTAL.*\s+(\d+%)$/'
artifacts:
reports:
Expand Down
8 changes: 1 addition & 7 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,8 @@ dist: focal
python:
- '3.10'
- '3.11'
services:
- mysql
before_script:
# In MySQL 8, local_infile is disabled by default for security reasons.
# By adding SET GLOBAL local_infile=1;, we enable this feature at runtime.
- mysql -e "SET GLOBAL local_infile=1;"
- pip install .
- pip install .[test]
script:
- echo "DB_HOST $METADATA_URI $TAXONOMY_URI"
- coverage run -m pytest -c pyproject.toml --server mysql://[email protected]:3306
- coverage run -m pytest -c pyproject.toml
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ dependencies = [
"duckdb-engine >= 0.17.0",
"pymysql",
"mysqlclient",
"pydantic"
]

[project.urls]
Expand Down
5 changes: 0 additions & 5 deletions src/ensembl/production/metadata/api/adaptors/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,6 @@
from ensembl.production.metadata.grpc.config import cfg


##Todo: Add in OrganismAdapator. Subfunction fetches all organism in popular group. and # of genomes from distinct assemblies.
# Add in best genome (see doc)
# More functions for related genomes


class BaseAdaptor:
def __init__(self, metadata_uri):
self.metadata_db = DBConnection(metadata_uri, pool_size=cfg.pool_size, pool_recycle=cfg.pool_recycle)
Expand Down
362 changes: 254 additions & 108 deletions src/ensembl/production/metadata/api/adaptors/genome.py

Large diffs are not rendered by default.

167 changes: 117 additions & 50 deletions src/ensembl/production/metadata/api/adaptors/release.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,53 +15,95 @@
from typing import List

import sqlalchemy as db
from sqlalchemy import and_

from ensembl.production.metadata.api.models import EnsemblRelease, EnsemblSite, GenomeRelease, Genome, GenomeDataset, \
Dataset, ReleaseStatus
from ensembl.production.metadata.api.adaptors.base import check_parameter, BaseAdaptor, cfg
from ensembl.production.metadata.api.models import (
EnsemblRelease,
EnsemblSite,
GenomeRelease,
Genome,
GenomeDataset,
Dataset,
ReleaseStatus,
)

logger = logging.getLogger(__name__)


def filter_release_status(query,
release_status: str | ReleaseStatus = None):
def filter_release_status(query, release_status: str | ReleaseStatus = None):
"""
Adds EnsemblSite join and filters based on release status and configuration.

Args:
query: The SQLAlchemy query to filter
release_status: Optional release status to filter by

Returns:
Modified query with site join and status filters applied
"""
logger.debug(f"Allowed unreleased {cfg.allow_unreleased}")
query = query.add_columns(EnsemblSite)

if not cfg.allow_unreleased:
query = query.join(EnsemblSite,
EnsemblSite.site_id == EnsemblRelease.site_id &
EnsemblSite.site_id == cfg.ensembl_site_id) \
.filter(EnsemblRelease.status == ReleaseStatus.RELEASED)
# For released only: use inner join and filter
query = query.join(
EnsemblSite,
and_(EnsemblSite.site_id == EnsemblRelease.site_id, EnsemblSite.site_id == cfg.ensembl_site_id),
).filter(EnsemblRelease.status == ReleaseStatus.RELEASED)
else:
query = query.outerjoin(EnsemblSite,
EnsemblSite.site_id == EnsemblRelease.site_id &
EnsemblSite.site_id == cfg.ensembl_site_id)
# Release status filter only work when unreleased are allowed
# For unreleased allowed: use outer join
query = query.outerjoin(
EnsemblSite,
and_(EnsemblSite.site_id == EnsemblRelease.site_id, EnsemblSite.site_id == cfg.ensembl_site_id),
)
# Release status filter only works when unreleased are allowed
if release_status:
if isinstance(release_status, str):
release_status = ReleaseStatus(release_status)
query = query.filter(EnsemblRelease.status == release_status)

return query


def _ensure_scalar(value):
"""
Ensures a parameter is a scalar value, unwrapping single-element lists.
Handles pytest parametrization edge cases.

Args:
value: The value to check

Returns:
Scalar value or None
"""

if isinstance(value, (list, tuple)) and len(value) == 1:
value = value[0]

return value


class ReleaseAdaptor(BaseAdaptor):

def fetch_releases(self,
release_id: int | List[int] = None,
release_version: float | List[float] = None,
current_only: bool = False,
site_name: str = None,
release_type: str = None,
release_label: str = None,
release_status: str | ReleaseStatus = None):
def fetch_releases(
self,
release_id: int | List[int] = None,
release_version: float | List[float] = None,
current_only: bool = False,
site_name: str = None,
release_type: str = None,
release_label: str = None,
release_status: str | ReleaseStatus = None,
):
"""
Fetches releases based on the provided parameters.

Args:
release_id: release internal id (int or list[int])
release_version (float or list or None): Release version(s) to filter by.
current_only (bool): Flag indicating whether to fetch only current releases.
site_name (str): SIte name to filter by.
site_name (str): Site name to filter by.
release_type (str): Release type to filter by.
release_label (str): Release label to filter by.
release_status: whether to filter particular release status
Expand All @@ -73,71 +115,96 @@ def fetch_releases(self,

releases_id = check_parameter(release_id)
if releases_id is not None:
release_select = release_select.filter(
EnsemblRelease.release_id.in_(releases_id)
)
release_select = release_select.filter(EnsemblRelease.release_id.in_(releases_id))

release_version = check_parameter(release_version)
# WHERE ensembl_release.version < version
# Handle release_version parameter
# Ensure it's a scalar for <= comparison, or list for IN clause
release_version = _ensure_scalar(check_parameter(release_version))
if release_version is not None:
release_select = release_select.filter(
EnsemblRelease.version <= release_version
)
# WHERE ensembl_release.is_current =:is_current_1
if isinstance(release_version, (list, tuple)):
# Multiple versions: use IN clause
release_select = release_select.filter(EnsemblRelease.version.in_(release_version))
else:
# Single version: use <= comparison
# Convert to float to ensure type compatibility with SQLite
release_version = float(release_version)
release_select = release_select.filter(EnsemblRelease.version <= release_version)

if current_only:
release_select = release_select.filter(
EnsemblRelease.is_current == 1
)
release_select = release_select.filter(EnsemblRelease.is_current == 1)

# WHERE ensembl_release.release_type = :release_type_1
if release_type is not None:
release_select = release_select.filter(
EnsemblRelease.release_type.in_(release_type)
)
release_type = check_parameter(release_type)
release_select = release_select.filter(EnsemblRelease.release_type.in_(release_type))

if release_label is not None:
release_select = release_select.filter(
EnsemblRelease.label.in_(release_label)
)
release_label = check_parameter(release_label)
release_select = release_select.filter(EnsemblRelease.label.in_(release_label))

# Filter by site name (requires site join, so must come before filter_release_status)
if site_name is not None:
release_select = release_select.filter(
EnsemblSite.name.in_(site_name)
)
site_name = check_parameter(site_name)
release_select = release_select.filter(EnsemblSite.name.in_(site_name))

release_select = release_select.filter(
EnsemblSite.site_id == cfg.ensembl_site_id
)
# Add site join and status filters
# NOTE: This already handles the site_id == cfg.ensembl_site_id filter
release_select = filter_release_status(release_select, release_status)

logger.debug("Query: %s ", release_select)

with self.metadata_db.session_scope() as session:
session.expire_on_commit = False
return session.execute(release_select).all()

def fetch_releases_for_genome(self, genome_uuid):
"""
Fetches releases associated with a specific genome.

Args:
genome_uuid: The UUID of the genome

Returns:
list: A list of releases for the genome
"""
select_released = db.select(EnsemblRelease).join(GenomeRelease)

if not cfg.allow_unreleased:
select_released = select_released.filter(EnsemblRelease.status == ReleaseStatus.RELEASED)

select_released = select_released.join(Genome).where(Genome.genome_uuid == genome_uuid)
select_released = filter_release_status(select_released)

logger.debug("Query: %s ", select_released)

with self.metadata_db.session_scope() as session:
session.expire_on_commit = False
releases = session.execute(select_released).all()
return releases

def fetch_releases_for_dataset(self, dataset_uuid):
select_released = db.select(EnsemblRelease) \
.select_from(Dataset) \
.join(GenomeDataset) \
.join(EnsemblRelease) \
"""
Fetches releases associated with a specific dataset.

Args:
dataset_uuid: The UUID of the dataset

Returns:
list: A list of releases for the dataset
"""
select_released = (
db.select(EnsemblRelease)
.select_from(Dataset)
.join(GenomeDataset)
.join(EnsemblRelease)
.where(Dataset.dataset_uuid == dataset_uuid)
)

if not cfg.allow_unreleased:
select_released = select_released.filter(EnsemblRelease.status == ReleaseStatus.RELEASED)

select_released = filter_release_status(select_released)
logger.debug("Query: %s ", select_released)

with self.metadata_db.session_scope() as session:
session.expire_on_commit = False
releases = session.execute(select_released).all()
Expand Down
38 changes: 34 additions & 4 deletions src/ensembl/production/metadata/api/exports/ftp_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,10 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import json
import re
import sys
from collections import defaultdict
from datetime import datetime

Expand Down Expand Up @@ -558,8 +560,36 @@ def _get_dataset_file_paths(self, base_path, dataset_type, genome, assembly_data
return file_paths


def main() -> None:
"""Main entry point for the script."""

parser = argparse.ArgumentParser(
description="Generate index files for the ftp"
)
parser.add_argument(
"--metadata-uri",
required=True,
help="Database URI for the metadata database"
)
parser.add_argument(
"--output-path",
default="species.json",
help="Optional output path for the stats files. Filenames will be: "
"species.json Defaults to current directory."
)
args = parser.parse_args()

try:
exporter = FTPMetadataExporter(metadata_uri=args.metadata_uri)
metadata = exporter.export_to_json(args.output_path)
print(f"Metadata exported to {args.output_path}")
except ValueError as e:
print(e)
sys.exit(1)
except Exception as e:
print(f"Error generating release statistics: {e}")
sys.exit(1)


if __name__ == "__main__":
exporter = FTPMetadataExporter("mysql://user:pass@host:port/database")
exporter.export_to_json("ftp_metadata.json")
metadata = exporter.export_to_json()
print(f"Found {len(metadata['species'])} species with released datasets")
main()
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
This module provides functionality to generate release statistics for both
partial and integrated Ensembl releases, exporting the data to CSV format.
"""
import argparse
import csv
import logging
import sys
Expand Down Expand Up @@ -361,7 +362,6 @@ def export_to_csv(

def main() -> None:
"""Main entry point for the script."""
import argparse

parser = argparse.ArgumentParser(
description="Generate release statistics for Ensembl releases"
Expand Down
Loading