Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds initial support for the data-platform API, enabling users to search and query various entities (ligands, proteins, etc.) through the DeepOriginClient.
Changes:
- Added a new
DataAPI wrapper class with methods for health checks, model listing, and entity searching - Integrated the Data wrapper into DeepOriginClient
- Added comprehensive test coverage and mock server endpoints for the new functionality
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/platform/data.py | New Data API wrapper class with methods for searching entities, ligands, and proteins, plus health and model listing endpoints |
| src/platform/client.py | Integration of Data wrapper into DeepOriginClient initialization |
| tests/test_data.py | Comprehensive test suite for data platform functionality including health checks, search operations, and error handling |
| tests/mock_server/server.py | Mock endpoints for data platform API including health, search, and model listing |
| docs/platform/ref/data.md | User-facing documentation for the Data API with usage examples |
| mkdocs.yaml | Added data reference to documentation navigation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def test_search_proteins_lv1(): | ||
| """Test searching proteins using convenience method.""" | ||
| client = DeepOriginClient() | ||
| response = client.data.search_proteins() | ||
|
|
||
| assert isinstance(response, dict), "Expected a dictionary response" | ||
| assert "data" in response, "Expected 'data' key in response" | ||
| assert isinstance(response["data"], list), "Expected 'data' to be a list" | ||
|
|
||
|
|
||
| def test_search_proteins_molecular_weight(): | ||
| """Test searching proteins with molecular weight filters.""" | ||
| client = DeepOriginClient() | ||
| response = client.data.search_proteins( | ||
| min_molecular_weight=250, max_molecular_weight=550 | ||
| ) | ||
|
|
||
| assert isinstance(response, dict), "Expected a dictionary response" | ||
| assert "data" in response, "Expected 'data' key in response" | ||
| assert isinstance(response["data"], list), "Expected 'data' to be a list" | ||
|
|
There was a problem hiding this comment.
There is no test coverage for the pdb_id parameter in search_proteins. Consider adding a test that verifies the pdb_id filter works correctly.
| def search_proteins( | ||
| self, | ||
| *, | ||
| cursor: str | None = None, | ||
| pdb_id: str | None = None, | ||
| min_molecular_weight: float | int | None = None, | ||
| max_molecular_weight: float | int | None = None, | ||
| sequence: str | None = None, | ||
| limit: int | None = None, | ||
| offset: int | None = None, | ||
| select: list[str] | None = None, | ||
| sort: dict[str, str] | None = None, | ||
| ) -> dict: |
There was a problem hiding this comment.
The search_proteins method does not accept a filter parameter like search_ligands does (line 157). Users cannot pass custom filters to search_proteins, which limits flexibility. Consider adding a filter parameter to maintain consistency with search_ligands and allow users to specify custom filter criteria beyond the convenience parameters (pdb_id, molecular_weight, sequence).
There was a problem hiding this comment.
^ maybe, later. not sure if the filter syntax is usable
| def test_search_ligands_lv1(): | ||
| """Test searching ligands using convenience method.""" | ||
| client = DeepOriginClient() | ||
| response = client.data.search_ligands() | ||
|
|
||
| assert isinstance(response, dict), "Expected a dictionary response" | ||
| assert "data" in response, "Expected 'data' key in response" | ||
| assert isinstance(response["data"], list), "Expected 'data' to be a list" | ||
|
|
There was a problem hiding this comment.
There is no test coverage for the filter parameter in search_ligands. The test_search_ligands_lv1 function only tests the basic case without filters, and test_search_ligands_molecular_weight only tests the molecular weight convenience parameters. Consider adding a test that verifies the filter parameter works correctly, similar to how the search method is tested in test_search_entity_lv1.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/platform/data.py
Outdated
| self, | ||
| *, | ||
| cursor: str | None = None, | ||
| filter: dict[str, Any] | None = None, |
There was a problem hiding this comment.
The parameter name 'filter' in search_ligands is inconsistent with 'filter_dict' used in the base search() method. While both names work, using 'filter_dict' consistently across all methods would improve API consistency and make it clearer that the parameter expects a dictionary. Consider renaming 'filter' to 'filter_dict' for consistency.
| filter: dict[str, Any] | None = None, | |
| filter_dict: dict[str, Any] | None = None, |
| def search_proteins( | ||
| self, | ||
| *, | ||
| cursor: str | None = None, | ||
| pdb_id: str | None = None, | ||
| file_path: str | None = None, | ||
| min_molecular_weight: float | int | None = None, | ||
| max_molecular_weight: float | int | None = None, | ||
| sequence: str | None = None, | ||
| limit: int | None = None, | ||
| offset: int | None = None, | ||
| select: list[str] | None = None, | ||
| sort: dict[str, str] | None = None, | ||
| ) -> dict: |
There was a problem hiding this comment.
The search_proteins method lacks a 'filter' parameter for additional filtering, while search_ligands has this parameter. This creates an API inconsistency between the two similar convenience methods. Users may need to pass custom filters for proteins similar to how they can for ligands. Consider adding a 'filter' parameter to search_proteins for consistency and flexibility.
| it returns the existing protein data instead of creating a new one. | ||
|
|
||
| Args: | ||
| client: DeepOriginClient instance. If None, uses DeepOriginClient.get(). | ||
|
|
||
| Returns: | ||
| Dictionary containing the created or existing protein data from the data platform. |
There was a problem hiding this comment.
The docstring states "Returns: Dictionary containing the created or existing protein data from the data platform" but the method actually returns None. The docstring should be corrected to indicate that this method returns None and instead updates self.id as a side effect.
| it returns the existing protein data instead of creating a new one. | |
| Args: | |
| client: DeepOriginClient instance. If None, uses DeepOriginClient.get(). | |
| Returns: | |
| Dictionary containing the created or existing protein data from the data platform. | |
| it updates the current instance with the existing protein's ID instead of | |
| creating a new one. | |
| Args: | |
| client: DeepOriginClient instance. If None, uses DeepOriginClient.get(). | |
| Returns: | |
| None. As a side effect, uploads the protein (if necessary) and updates | |
| ``self.id`` with the ID of the existing or newly created protein record. |
| @beartype | ||
| def sync(self, client: Optional[DeepOriginClient] = None) -> None: | ||
| """Sync the ligand to the data platform. | ||
|
|
||
| This method uploads the ligand file to remote storage (if available) and creates a ligand | ||
| record in the data platform. If a ligand with the same canonical_smiles already exists, | ||
| it returns the existing ligand data instead of creating a new one. | ||
|
|
||
| Args: | ||
| client: DeepOriginClient instance. If None, uses DeepOriginClient.get(). | ||
|
|
||
| Note: | ||
| If the ligand was created from a SMILES string without an SDF file, only the SMILES | ||
| will be used for syncing (no file upload will occur). | ||
| """ | ||
| if client is None: | ||
| client = DeepOriginClient.get() | ||
|
|
||
| # If ligand has a file_path, upload it to remote storage | ||
| # (Note: ligands in the data platform are identified by canonical_smiles, not file_path) | ||
| if self.file_path is not None: | ||
| # Upload the ligand file first | ||
| self.upload(client=client) | ||
|
|
||
| # Search for existing ligands by canonical_smiles | ||
| response = client.data.search_ligands(canonical_smiles=self.canonical_smiles) | ||
| data = response["data"] | ||
|
|
||
| # If a ligand with this canonical_smiles already exists, update self.id and return | ||
| if data: | ||
| existing_ligand = data[0] | ||
| if "id" in existing_ligand: | ||
| self.id = existing_ligand["id"] | ||
| return | ||
|
|
||
| # No existing ligand found, create a new one | ||
| # Prepare parameters for create_ligand | ||
| # Note: canonical_smiles is read-only and computed by the platform | ||
| kwargs: dict[str, Any] = { | ||
| "smiles": self.smiles if self.smiles is not None else self.canonical_smiles, | ||
| } | ||
|
|
||
| # Add optional fields if available | ||
| if self.name is not None: | ||
| kwargs["name"] = self.name | ||
|
|
||
| # Add computed molecular properties if mol is available | ||
| if self.mol is not None: | ||
| try: | ||
| kwargs["formal_charge"] = self.formal_charge | ||
| kwargs["molecular_weight"] = self.molecular_weight | ||
| kwargs["hbond_donor_count"] = self.hbond_donor_count | ||
| kwargs["hbond_acceptor_count"] = self.hbond_acceptor_count | ||
| kwargs["rotatable_bond_count"] = self.rotatable_bond_count | ||
| kwargs["tpsa"] = self.tpsa | ||
| except Exception: | ||
| # If property computation fails, continue without those properties | ||
| pass | ||
|
|
||
| # Call create_ligand through the client | ||
| result = client.data.create_ligand(**kwargs) | ||
|
|
||
| # Update self.id with the newly created ligand's ID | ||
| if "data" in result and "id" in result["data"]: | ||
| self.id = result["data"]["id"] |
There was a problem hiding this comment.
The newly added sync() method for Ligand is not covered by tests. Given that other methods in the same file have test coverage, this method should have tests to verify it correctly uploads files, searches for existing ligands by canonical_smiles, creates new ligands when needed, and properly updates the self.id attribute.
| @beartype | ||
| def sync(self, client: Optional[DeepOriginClient] = None) -> None: | ||
| """Sync the protein to the data platform. | ||
|
|
||
| This method uploads the protein file to remote storage and creates a protein | ||
| record in the data platform. If a protein with the same file_path already exists, | ||
| it returns the existing protein data instead of creating a new one. | ||
|
|
||
| Args: | ||
| client: DeepOriginClient instance. If None, uses DeepOriginClient.get(). | ||
|
|
||
| Returns: | ||
| Dictionary containing the created or existing protein data from the data platform. | ||
| """ | ||
| if client is None: | ||
| client = DeepOriginClient.get() | ||
|
|
||
| # Upload the protein file first | ||
| self.upload(client=client) | ||
|
|
||
| # Use the remote path as the file_path | ||
| file_path = self._remote_path | ||
|
|
||
| # Search for existing proteins with the same file_path | ||
| response = client.data.search_proteins(file_path=file_path) | ||
| data = response["data"] | ||
|
|
||
| # If a protein with this file_path already exists, return the first one | ||
| if data: | ||
| existing_protein = data[0] | ||
| # Update self.id with the existing protein's ID | ||
| if "id" in existing_protein: | ||
| self.id = existing_protein["id"] | ||
| return | ||
|
|
||
| # No existing protein found, create a new one | ||
| # Prepare parameters for create_protein | ||
| kwargs: dict[str, Any] = { | ||
| "file_path": file_path, | ||
| } | ||
|
|
||
| # Pass pdb_id if available | ||
| if self.pdb_id is not None: | ||
| kwargs["pdb_id"] = self.pdb_id | ||
|
|
||
| kwargs["protein_length"] = self.length | ||
| kwargs["protein_name"] = self.name | ||
|
|
||
| # Call create_protein through the client | ||
| result = client.data.create_protein(**kwargs) | ||
|
|
||
| # Update self.id with the newly created protein's ID | ||
| if "data" in result and "id" in result["data"]: | ||
| self.id = result["data"]["id"] |
There was a problem hiding this comment.
The newly added sync() method for Protein is not covered by tests. Given that other methods in the same file have test coverage, this method should have tests to verify it correctly uploads files, searches for existing proteins by file_path, creates new proteins when needed, and properly updates the self.id attribute.
| if self.pdb_id is not None: | ||
| kwargs["pdb_id"] = self.pdb_id | ||
|
|
||
| kwargs["protein_length"] = self.length |
There was a problem hiding this comment.
The sync() method unconditionally calls self.length (line 1356), which in turn accesses self.sequence, which requires self.file_path to be non-None. However, the Protein class allows file_path to be Optional[Path]. If a Protein instance is created without a file_path, calling sync() will fail with an error when trying to compute the length. Consider adding a check to ensure file_path is not None before calling self.length, or handle the case where length cannot be computed.
| kwargs["protein_length"] = self.length | |
| # Only compute and include protein_length when a local file_path is available | |
| if getattr(self, "file_path", None) is not None: | |
| kwargs["protein_length"] = self.length |
| ligand = client.data.create_ligand( | ||
| project_id="\\x0011223344556677", | ||
| canonical_smiles="CCOc1ccc2nc(S(=O)(=O)N3CCN(CC3)C)c(N)c2c1", | ||
| inchi_key="BSYNRYMUTXBXSQ-UHFFFAOYSA-N", | ||
| inchi="InChI=1S/C20H24N4O4S/.../h1-4,6-9H,5,10-14H2,(H,22,23)", | ||
| smiles="CCOc1ccc2nc(S(=O)(=O)N3CCN(CC3)C)c(N)c2c1", | ||
| name="Compound-12345", | ||
| formal_charge=0, | ||
| hbond_donor_count=1, | ||
| hbond_acceptor_count=6, | ||
| rotatable_bond_count=5, | ||
| tpsa=85.12, | ||
| molecular_weight=447.5, | ||
| ) |
There was a problem hiding this comment.
The documentation example shows parameters that are not accepted by the create_ligand method. The method does not accept 'canonical_smiles' or 'inchi_key' or 'inchi' as parameters - these are computed by the platform and returned in the response. Only 'smiles' should be passed as input. The example should be updated to only include valid parameters.
| @property | ||
| def length(self) -> int: | ||
| """get the length of the protein structure""" | ||
| return sum([len(seq) for seq in self.sequence]) |
There was a problem hiding this comment.
The newly added length property for Protein is not covered by tests. Given that other properties in the same file have test coverage, this property should have tests to verify it correctly sums the lengths of all sequences in the protein structure.
| filter_dict = {"deleted": False} | ||
| else: | ||
| filter_dict = filter_dict.copy() | ||
| filter_dict["deleted"] = False |
There was a problem hiding this comment.
The search() method always sets filter_dict["deleted"] = False, which prevents users from searching for deleted entities even if they explicitly pass "deleted": True in their filter. While this might be the intended behavior for most use cases, it removes flexibility and could be unexpected. Consider either documenting this behavior clearly or allowing users to explicitly search for deleted entities when needed.
| filter_dict = {"deleted": False} | |
| else: | |
| filter_dict = filter_dict.copy() | |
| filter_dict["deleted"] = False | |
| # By default, exclude deleted entities when no filter is provided. | |
| filter_dict = {"deleted": False} | |
| else: | |
| # Work on a copy and only set a default for "deleted" if the caller | |
| # did not explicitly provide it, allowing explicit searches for | |
| # deleted entities when needed. | |
| filter_dict = filter_dict.copy() | |
| filter_dict.setdefault("deleted", False) |
| This class manages the remote path and provides an upload method to ensure that the entity's file is uploaded to the remote storage if it does not already exist there. It uses the DeepOrigin FilesClient for remote file operations. | ||
| """ | ||
|
|
||
| id: str | None = field(default=None, kw_only=True) |
There was a problem hiding this comment.
all entities have a ID now. this is the ID in the data platform
| def sync(self, client: Optional[DeepOriginClient] = None) -> None: | ||
| """Sync the ligand to the data platform. | ||
|
|
||
| This method uploads the ligand file to remote storage (if available) and creates a ligand | ||
| record in the data platform. If a ligand with the same canonical_smiles already exists, |
There was a problem hiding this comment.
syncing syncs with the data platform/ufa
changes
will not implement this pr