Skip to content

feat: initial support for data-platform#430

Open
sg-s wants to merge 12 commits intomainfrom
data-platform-api
Open

feat: initial support for data-platform#430
sg-s wants to merge 12 commits intomainfrom
data-platform-api

Conversation

@sg-s
Copy link
Collaborator

@sg-s sg-s commented Feb 3, 2026

changes

  • basic support for data-platform
  • ability to register a Ligand
  • ability to register a protein
  • ability to search for a Ligand
  • ability to search for a protein
  • ability to construct a Ligand from ID
  • ability to construct a Protein from ID

will not implement this pr

  • ability to list projects
  • ability to create a project
  • ability to check if a project exists

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds initial support for the data-platform API, enabling users to search and query various entities (ligands, proteins, etc.) through the DeepOriginClient.

Changes:

  • Added a new Data API wrapper class with methods for health checks, model listing, and entity searching
  • Integrated the Data wrapper into DeepOriginClient
  • Added comprehensive test coverage and mock server endpoints for the new functionality

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/platform/data.py New Data API wrapper class with methods for searching entities, ligands, and proteins, plus health and model listing endpoints
src/platform/client.py Integration of Data wrapper into DeepOriginClient initialization
tests/test_data.py Comprehensive test suite for data platform functionality including health checks, search operations, and error handling
tests/mock_server/server.py Mock endpoints for data platform API including health, search, and model listing
docs/platform/ref/data.md User-facing documentation for the Data API with usage examples
mkdocs.yaml Added data reference to documentation navigation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +70 to +90
def test_search_proteins_lv1():
"""Test searching proteins using convenience method."""
client = DeepOriginClient()
response = client.data.search_proteins()

assert isinstance(response, dict), "Expected a dictionary response"
assert "data" in response, "Expected 'data' key in response"
assert isinstance(response["data"], list), "Expected 'data' to be a list"


def test_search_proteins_molecular_weight():
"""Test searching proteins with molecular weight filters."""
client = DeepOriginClient()
response = client.data.search_proteins(
min_molecular_weight=250, max_molecular_weight=550
)

assert isinstance(response, dict), "Expected a dictionary response"
assert "data" in response, "Expected 'data' key in response"
assert isinstance(response["data"], list), "Expected 'data' to be a list"

Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no test coverage for the pdb_id parameter in search_proteins. Consider adding a test that verifies the pdb_id filter works correctly.

Copilot uses AI. Check for mistakes.
Comment on lines 223 to 235
def search_proteins(
self,
*,
cursor: str | None = None,
pdb_id: str | None = None,
min_molecular_weight: float | int | None = None,
max_molecular_weight: float | int | None = None,
sequence: str | None = None,
limit: int | None = None,
offset: int | None = None,
select: list[str] | None = None,
sort: dict[str, str] | None = None,
) -> dict:
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The search_proteins method does not accept a filter parameter like search_ligands does (line 157). Users cannot pass custom filters to search_proteins, which limits flexibility. Consider adding a filter parameter to maintain consistency with search_ligands and allow users to specify custom filter criteria beyond the convenience parameters (pdb_id, molecular_weight, sequence).

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ maybe, later. not sure if the filter syntax is usable

Comment on lines +48 to +56
def test_search_ligands_lv1():
"""Test searching ligands using convenience method."""
client = DeepOriginClient()
response = client.data.search_ligands()

assert isinstance(response, dict), "Expected a dictionary response"
assert "data" in response, "Expected 'data' key in response"
assert isinstance(response["data"], list), "Expected 'data' to be a list"

Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no test coverage for the filter parameter in search_ligands. The test_search_ligands_lv1 function only tests the basic case without filters, and test_search_ligands_molecular_weight only tests the molecular weight convenience parameters. Consider adding a test that verifies the filter parameter works correctly, similar to how the search method is tested in test_search_entity_lv1.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

self,
*,
cursor: str | None = None,
filter: dict[str, Any] | None = None,
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter name 'filter' in search_ligands is inconsistent with 'filter_dict' used in the base search() method. While both names work, using 'filter_dict' consistently across all methods would improve API consistency and make it clearer that the parameter expects a dictionary. Consider renaming 'filter' to 'filter_dict' for consistency.

Suggested change
filter: dict[str, Any] | None = None,
filter_dict: dict[str, Any] | None = None,

Copilot uses AI. Check for mistakes.
Comment on lines +240 to +253
def search_proteins(
self,
*,
cursor: str | None = None,
pdb_id: str | None = None,
file_path: str | None = None,
min_molecular_weight: float | int | None = None,
max_molecular_weight: float | int | None = None,
sequence: str | None = None,
limit: int | None = None,
offset: int | None = None,
select: list[str] | None = None,
sort: dict[str, str] | None = None,
) -> dict:
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The search_proteins method lacks a 'filter' parameter for additional filtering, while search_ligands has this parameter. This creates an API inconsistency between the two similar convenience methods. Users may need to pass custom filters for proteins similar to how they can for ligands. Consider adding a 'filter' parameter to search_proteins for consistency and flexibility.

Copilot uses AI. Check for mistakes.
Comment on lines +1317 to +1323
it returns the existing protein data instead of creating a new one.

Args:
client: DeepOriginClient instance. If None, uses DeepOriginClient.get().

Returns:
Dictionary containing the created or existing protein data from the data platform.
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring states "Returns: Dictionary containing the created or existing protein data from the data platform" but the method actually returns None. The docstring should be corrected to indicate that this method returns None and instead updates self.id as a side effect.

Suggested change
it returns the existing protein data instead of creating a new one.
Args:
client: DeepOriginClient instance. If None, uses DeepOriginClient.get().
Returns:
Dictionary containing the created or existing protein data from the data platform.
it updates the current instance with the existing protein's ID instead of
creating a new one.
Args:
client: DeepOriginClient instance. If None, uses DeepOriginClient.get().
Returns:
None. As a side effect, uploads the protein (if necessary) and updates
``self.id`` with the ID of the existing or newly created protein record.

Copilot uses AI. Check for mistakes.
Comment on lines 993 to 1057
@beartype
def sync(self, client: Optional[DeepOriginClient] = None) -> None:
"""Sync the ligand to the data platform.

This method uploads the ligand file to remote storage (if available) and creates a ligand
record in the data platform. If a ligand with the same canonical_smiles already exists,
it returns the existing ligand data instead of creating a new one.

Args:
client: DeepOriginClient instance. If None, uses DeepOriginClient.get().

Note:
If the ligand was created from a SMILES string without an SDF file, only the SMILES
will be used for syncing (no file upload will occur).
"""
if client is None:
client = DeepOriginClient.get()

# If ligand has a file_path, upload it to remote storage
# (Note: ligands in the data platform are identified by canonical_smiles, not file_path)
if self.file_path is not None:
# Upload the ligand file first
self.upload(client=client)

# Search for existing ligands by canonical_smiles
response = client.data.search_ligands(canonical_smiles=self.canonical_smiles)
data = response["data"]

# If a ligand with this canonical_smiles already exists, update self.id and return
if data:
existing_ligand = data[0]
if "id" in existing_ligand:
self.id = existing_ligand["id"]
return

# No existing ligand found, create a new one
# Prepare parameters for create_ligand
# Note: canonical_smiles is read-only and computed by the platform
kwargs: dict[str, Any] = {
"smiles": self.smiles if self.smiles is not None else self.canonical_smiles,
}

# Add optional fields if available
if self.name is not None:
kwargs["name"] = self.name

# Add computed molecular properties if mol is available
if self.mol is not None:
try:
kwargs["formal_charge"] = self.formal_charge
kwargs["molecular_weight"] = self.molecular_weight
kwargs["hbond_donor_count"] = self.hbond_donor_count
kwargs["hbond_acceptor_count"] = self.hbond_acceptor_count
kwargs["rotatable_bond_count"] = self.rotatable_bond_count
kwargs["tpsa"] = self.tpsa
except Exception:
# If property computation fails, continue without those properties
pass

# Call create_ligand through the client
result = client.data.create_ligand(**kwargs)

# Update self.id with the newly created ligand's ID
if "data" in result and "id" in result["data"]:
self.id = result["data"]["id"]
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly added sync() method for Ligand is not covered by tests. Given that other methods in the same file have test coverage, this method should have tests to verify it correctly uploads files, searches for existing ligands by canonical_smiles, creates new ligands when needed, and properly updates the self.id attribute.

Copilot uses AI. Check for mistakes.
Comment on lines +1311 to +1364
@beartype
def sync(self, client: Optional[DeepOriginClient] = None) -> None:
"""Sync the protein to the data platform.

This method uploads the protein file to remote storage and creates a protein
record in the data platform. If a protein with the same file_path already exists,
it returns the existing protein data instead of creating a new one.

Args:
client: DeepOriginClient instance. If None, uses DeepOriginClient.get().

Returns:
Dictionary containing the created or existing protein data from the data platform.
"""
if client is None:
client = DeepOriginClient.get()

# Upload the protein file first
self.upload(client=client)

# Use the remote path as the file_path
file_path = self._remote_path

# Search for existing proteins with the same file_path
response = client.data.search_proteins(file_path=file_path)
data = response["data"]

# If a protein with this file_path already exists, return the first one
if data:
existing_protein = data[0]
# Update self.id with the existing protein's ID
if "id" in existing_protein:
self.id = existing_protein["id"]
return

# No existing protein found, create a new one
# Prepare parameters for create_protein
kwargs: dict[str, Any] = {
"file_path": file_path,
}

# Pass pdb_id if available
if self.pdb_id is not None:
kwargs["pdb_id"] = self.pdb_id

kwargs["protein_length"] = self.length
kwargs["protein_name"] = self.name

# Call create_protein through the client
result = client.data.create_protein(**kwargs)

# Update self.id with the newly created protein's ID
if "data" in result and "id" in result["data"]:
self.id = result["data"]["id"]
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly added sync() method for Protein is not covered by tests. Given that other methods in the same file have test coverage, this method should have tests to verify it correctly uploads files, searches for existing proteins by file_path, creates new proteins when needed, and properly updates the self.id attribute.

Copilot uses AI. Check for mistakes.
if self.pdb_id is not None:
kwargs["pdb_id"] = self.pdb_id

kwargs["protein_length"] = self.length
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sync() method unconditionally calls self.length (line 1356), which in turn accesses self.sequence, which requires self.file_path to be non-None. However, the Protein class allows file_path to be Optional[Path]. If a Protein instance is created without a file_path, calling sync() will fail with an error when trying to compute the length. Consider adding a check to ensure file_path is not None before calling self.length, or handle the case where length cannot be computed.

Suggested change
kwargs["protein_length"] = self.length
# Only compute and include protein_length when a local file_path is available
if getattr(self, "file_path", None) is not None:
kwargs["protein_length"] = self.length

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +46
ligand = client.data.create_ligand(
project_id="\\x0011223344556677",
canonical_smiles="CCOc1ccc2nc(S(=O)(=O)N3CCN(CC3)C)c(N)c2c1",
inchi_key="BSYNRYMUTXBXSQ-UHFFFAOYSA-N",
inchi="InChI=1S/C20H24N4O4S/.../h1-4,6-9H,5,10-14H2,(H,22,23)",
smiles="CCOc1ccc2nc(S(=O)(=O)N3CCN(CC3)C)c(N)c2c1",
name="Compound-12345",
formal_charge=0,
hbond_donor_count=1,
hbond_acceptor_count=6,
rotatable_bond_count=5,
tpsa=85.12,
molecular_weight=447.5,
)
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation example shows parameters that are not accepted by the create_ligand method. The method does not accept 'canonical_smiles' or 'inchi_key' or 'inchi' as parameters - these are computed by the platform and returned in the response. Only 'smiles' should be passed as input. The example should be updated to only include valid parameters.

Copilot uses AI. Check for mistakes.
Comment on lines +1371 to +1374
@property
def length(self) -> int:
"""get the length of the protein structure"""
return sum([len(seq) for seq in self.sequence])
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly added length property for Protein is not covered by tests. Given that other properties in the same file have test coverage, this property should have tests to verify it correctly sums the lengths of all sequences in the protein structure.

Copilot uses AI. Check for mistakes.
Comment on lines +134 to +137
filter_dict = {"deleted": False}
else:
filter_dict = filter_dict.copy()
filter_dict["deleted"] = False
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The search() method always sets filter_dict["deleted"] = False, which prevents users from searching for deleted entities even if they explicitly pass "deleted": True in their filter. While this might be the intended behavior for most use cases, it removes flexibility and could be unexpected. Consider either documenting this behavior clearly or allowing users to explicitly search for deleted entities when needed.

Suggested change
filter_dict = {"deleted": False}
else:
filter_dict = filter_dict.copy()
filter_dict["deleted"] = False
# By default, exclude deleted entities when no filter is provided.
filter_dict = {"deleted": False}
else:
# Work on a copy and only set a default for "deleted" if the caller
# did not explicitly provide it, allowing explicit searches for
# deleted entities when needed.
filter_dict = filter_dict.copy()
filter_dict.setdefault("deleted", False)

Copilot uses AI. Check for mistakes.
This class manages the remote path and provides an upload method to ensure that the entity's file is uploaded to the remote storage if it does not already exist there. It uses the DeepOrigin FilesClient for remote file operations.
"""

id: str | None = field(default=None, kw_only=True)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all entities have a ID now. this is the ID in the data platform

Comment on lines +994 to +998
def sync(self, client: Optional[DeepOriginClient] = None) -> None:
"""Sync the ligand to the data platform.

This method uploads the ligand file to remote storage (if available) and creates a ligand
record in the data platform. If a ligand with the same canonical_smiles already exists,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syncing syncs with the data platform/ufa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments