Split Vacancies and Foundation Defect Complexes Code #127

kavanase · 2025-06-09T19:48:39Z

See original commit names here: #119
Squashed and restarted after repo clean.

Mostly for tracking for now. Main TODO is to get automated split vacancy generation from ML-calculated database fully implemented. (General) complex defect symmetry and multiplicity analysis (& thus concentrations/thermodynamics) is now implemented, will also add automated complex defect structure recognition in parsing.

Key TODOs:

Split vacancy generation from database
(Efficient) complex defect structure recognition

coderabbitai · 2025-06-09T19:48:45Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This update introduces major new functionality for handling complex defect structures in crystalline materials. It adds modules for generating, classifying, and analyzing defect complexes, including split vacancies, and provides robust stenciling utilities for mapping relaxed defect structures into arbitrary supercells. Comprehensive documentation and extensive tests for these features are included.

Changes

File(s)	Change Summary
`doped/complexes.py`	New module for generating, classifying, and analyzing complex defects, including vacancy geometry classification, electrostatic energy calculation, symmetry-equivalent enumeration, and multiplicity determination.
`doped/utils/stenciling.py`	New module for reconstructing relaxed defect structures in arbitrary supercells with supporting utility functions for site matching, symmetry placement, and physical validation.
`doped/generation.py`	Added methods to `DefectsGenerator` for split vacancy identification and generation using electrostatics and geometry, with placeholder for database retrieval.
`doped/core.py`	Added helper function `_get_single_valence_oxi_states` to extract oxidation states from various input types.
`doped/utils/configurations.py`	Removed legacy alias `_scan_sm_stol_till_match`.
`docs/doped.complexes.rst`	New Sphinx documentation file for the `doped.complexes` module, including all members and inheritance relationships.
`docs/doped.rst`	Updated to include `doped.complexes` in the list of submodules.
`pyproject.toml`	Updated `shakenbreak` dependency from `>=3.4` to `>=3.4.2` and revised comment.
`tests/test_complexes.py`	New tests for complex defect generation, symmetry, multiplicity, and molecule equivalence, especially for split vacancies and Ga2O3 structures.
`tests/test_stenciling.py`	New tests for verifying defect stenciling into larger or differently oriented supercells, including structure matching and physical validation.
`tests/data/Ga2O3_R3c_POSCAR`	New POSCAR file providing Ga2O3 R3c structure for test input.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant DefectsGenerator
    participant ComplexesModule
    participant StencilingUtils
    participant StructureMatcher

    User->>DefectsGenerator: get_split_vacancies()
    DefectsGenerator->>ComplexesModule: generate_complex_from_defect_sites(...)
    ComplexesModule->>ComplexesModule: classify_vacancy_geometry(...)
    ComplexesModule->>ComplexesModule: get_equivalent_complex_defect_sites_in_primitive(...)
    ComplexesModule->>ComplexesModule: get_complex_defect_multiplicity(...)
    DefectsGenerator->>User: Return split vacancy candidates

    User->>StencilingUtils: get_defect_in_supercell(defect_entry, target_supercell)
    StencilingUtils->>StencilingUtils: Place defect site via symmetry
    StencilingUtils->>StencilingUtils: Generate super-supercell
    StencilingUtils->>StencilingUtils: Stencil sites to target supercell
    StencilingUtils->>StructureMatcher: Match and validate structure
    StencilingUtils->>User: Return stenciled defect and bulk supercells

Poem

In crystal fields where defects dwell,
New tools emerge to probe and tell—
Of split vacancies, symmetry's song,
And stenciled sites where atoms belong.
With tests and docs, the code now leaps—
Through lattices, a rabbit peeps! 🐇✨

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch split_vacancies

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 16

🧹 Nitpick comments (10)

doped/core.py (1)

2616-2632: New helper is unreferenced—DRY opportunity
The logic for extracting single‐valence oxidation states is still duplicated in _from_pmg_defect. Replace those inline dict comprehensions with calls to _get_single_valence_oxi_states to avoid code duplication and improve maintainability.
tests/test_stenciling.py (3)
20-29: Consider using pathlib for more robust file operations.

The current implementation works but could be simplified and made more robust using pathlib.
-def if_present_rm(path):
-    """
-    Remove the file/folder if it exists.
-    """
-    if os.path.exists(path):
-        if os.path.isfile(path):
-            os.remove(path)
-        elif os.path.isdir(path):
-            shutil.rmtree(path)
+from pathlib import Path
+
+def if_present_rm(path):
+    """
+    Remove the file/folder if it exists.
+    """
+    p = Path(path)
+    if p.exists():
+        if p.is_file():
+            p.unlink()
+        elif p.is_dir():
+            shutil.rmtree(p)
61-66: Track the TODO items for comprehensive test coverage.

The TODO comments indicate planned additions for extrinsic defects and split vacancies testing, which would enhance test coverage of the new complex defect functionality.

Would you like me to create GitHub issues to track these TODO items for future implementation?

119-127: Consider extracting file I/O testing to a separate utility.

The pattern of writing to file, reading back, and comparing could be extracted to a reusable test utility method to reduce duplication across tests.
def assert_structure_roundtrip(self, structure, filepath, reference):
    """Test that structure survives POSCAR write/read cycle."""
    structure.to(fmt="vasp", filename=filepath)
    reloaded = Structure.from_file(filepath)
    assert reloaded == reference
    if_present_rm(filepath)
tests/test_complexes.py (2)

58-60: Address the TODO comment for incomplete test setup.

This commented-out code appears to be setting up a specific split vacancy configuration but is incomplete. Consider either completing the implementation or removing it if no longer needed.

Would you like me to help complete this test setup or open an issue to track this task?

153-160: Complete or remove the commented test method.

The test_Ga2O3_generate_complex_from_defect_sites method is commented out with a TODO marker. This suggests incomplete test coverage for the generate_complex_from_defect_sites function.

Would you like me to help implement this test method or create an issue to track its completion?

doped/complexes.py (2)

164-164: Avoid accessing private attributes directly.

Using structure._charge directly accesses a private attribute, which violates encapsulation principles and could break if the internal implementation changes.

Consider using a public API method or property if available, or working with the pymatgen team to expose this functionality properly.

688-692: Track TODO items in issue tracker.

There are several TODO comments that should be properly tracked:

Future complex generation functionality (line 688-689)

DOI updates when published (line 690)

Missing tests (line 691)

Would you like me to create GitHub issues to track these TODO items?

doped/utils/stenciling.py (2)

154-166: Address TODO comments for production readiness.

Multiple TODO items indicate this code may not be fully production-ready:

Missing tests (line 155-156)

Unimplemented symmetry analysis feature (lines 157-160)

Missing documentation (lines 161-165)

Would you like me to help create unit tests or open issues to track these items?

853-878: Consider extracting nested function for clarity.

The nested _check_other_sites function could be moved to module level as a private function to improve readability and testability.

This would make the main function logic clearer and allow the helper to be tested independently if needed.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ad0eee0 and 7e75028.

⛔ Files ignored due to path filters (7)

tests/data/Split_Vacancies/Ba3Al2(SiO4)3-mp-14049.json.gz is excluded by !**/*.gz
tests/data/Split_Vacancies/CsCd(PO3)3-mp-13738.json.gz is excluded by !**/*.gz
tests/data/Split_Vacancies/Ga2O3-mp-1243.json.gz is excluded by !**/*.gz
tests/data/Split_Vacancies/Ga2O3-mp-886.json.gz is excluded by !**/*.gz
tests/data/Split_Vacancies/GeI2-mp-1540935.json.gz is excluded by !**/*.gz
tests/data/Split_Vacancies/Li3Ti4O8-mp-752879.json.gz is excluded by !**/*.gz
tests/data/Split_Vacancies/Na2Mg(NiO2)2-mp-2217469.json.gz is excluded by !**/*.gz

📒 Files selected for processing (11)

docs/doped.complexes.rst (1 hunks)
docs/doped.rst (1 hunks)
doped/complexes.py (1 hunks)
doped/core.py (1 hunks)
doped/generation.py (3 hunks)
doped/utils/configurations.py (0 hunks)
doped/utils/stenciling.py (1 hunks)
pyproject.toml (1 hunks)
tests/data/Ga2O3_R3c_POSCAR (1 hunks)
tests/test_complexes.py (1 hunks)
tests/test_stenciling.py (1 hunks)

💤 Files with no reviewable changes (1)

doped/utils/configurations.py

🧰 Additional context used

🪛 Pylint (3.3.7)

tests/test_complexes.py

[refactor] 281-281: Too many local variables (24/15)

(R0914)

[refactor] 403-403: Too many local variables (28/15)

(R0914)

[refactor] 542-542: Too many local variables (31/15)

(R0914)

tests/test_stenciling.py

[refactor] 44-44: Too many instance attributes (10/7)

(R0902)

doped/generation.py

[refactor] 2465-2465: Too many local variables (16/15)

(R0914)

[refactor] 2539-2585: Too many nested blocks (7/5)

(R1702)

doped/complexes.py

[refactor] 238-238: Too many arguments (9/5)

(R0913)

[refactor] 238-238: Too many positional arguments (9/5)

(R0917)

[refactor] 238-238: Too many local variables (36/15)

(R0914)

[error] 366-373: Unexpected keyword argument 'symm_ops' in function call

(E1123)

[error] 366-373: Unexpected keyword argument 'dist_tol' in function call

(E1123)

[refactor] 567-567: Too many arguments (8/5)

(R0913)

[refactor] 567-567: Too many positional arguments (8/5)

(R0917)

doped/utils/stenciling.py

[refactor] 44-44: Too many arguments (6/5)

(R0913)

[refactor] 44-44: Too many positional arguments (6/5)

(R0917)

[refactor] 44-44: Too many local variables (36/15)

(R0914)

[error] 48-48: Value 'np.ndarray' is unsubscriptable

(E1136)

[refactor] 44-44: Too many statements (56/50)

(R0915)

[error] 348-348: Value 'np.ndarray' is unsubscriptable

(E1136)

[refactor] 385-385: Too many arguments (6/5)

(R0913)

[refactor] 385-385: Too many positional arguments (6/5)

(R0917)

[refactor] 385-385: Too many local variables (32/15)

(R0914)

[refactor] 385-385: Too many branches (14/12)

(R0912)

[refactor] 385-385: Too many statements (54/50)

(R0915)

[error] 576-576: Value 'np.ndarray' is unsubscriptable

(E1136)

[refactor] 913-913: Too many local variables (23/15)

(R0914)

[error] 1025-1025: Value 'np.ndarray' is unsubscriptable

(E1136)

[error] 1097-1097: Value 'np.ndarray' is unsubscriptable

(E1136)

🪛 Ruff (0.11.9)

tests/test_stenciling.py

67-67: Function name test_Se_20_Å_supercell contains a non-ASCII character

(PLC2401)

⏰ Context from checks skipped due to timeout of 90000ms (20)

GitHub Check: test (ubuntu-latest, 3.13, 10)
GitHub Check: test (ubuntu-latest, 3.13, 7)
GitHub Check: test (ubuntu-latest, 3.13, 4)
GitHub Check: test (ubuntu-latest, 3.13, 6)
GitHub Check: test (ubuntu-latest, 3.13, 9)
GitHub Check: test (ubuntu-latest, 3.13, 3)
GitHub Check: test (ubuntu-latest, 3.10, 9)
GitHub Check: test (ubuntu-latest, 3.13, 8)
GitHub Check: test (ubuntu-latest, 3.13, 1)
GitHub Check: test (ubuntu-latest, 3.13, 5)
GitHub Check: test (ubuntu-latest, 3.10, 6)
GitHub Check: test (ubuntu-latest, 3.10, 7)
GitHub Check: test (ubuntu-latest, 3.10, 2)
GitHub Check: test (ubuntu-latest, 3.13, 2)
GitHub Check: test (ubuntu-latest, 3.10, 4)
GitHub Check: test (ubuntu-latest, 3.10, 8)
GitHub Check: test (ubuntu-latest, 3.10, 10)
GitHub Check: test (ubuntu-latest, 3.10, 3)
GitHub Check: test (ubuntu-latest, 3.10, 1)
GitHub Check: test (ubuntu-latest, 3.10, 5)

🔇 Additional comments (11)

docs/doped.complexes.rst (1)

1-7: Documentation addition looks good
The Sphinx automodule directive correctly exposes doped.complexes with members, undocumented items, and inheritance. No issues found.

pyproject.toml (1)

26-26: LGTM!

The shakenbreak version update to >=3.4.2 is appropriate for the StructureMatcher efficiency improvements needed by the new complex defect functionality.

docs/doped.rst (1)

20-20: Documentation correctly updated.

The new doped.complexes submodule is properly added to the documentation structure in alphabetical order.

tests/data/Ga2O3_R3c_POSCAR (1)

1-39: Valid POSCAR test data file.

The Ga2O3 R3c structure file is properly formatted with correct atom counts, fractional coordinates, and helpful oxidation state annotations for defect complex testing.

tests/test_stenciling.py (1)

93-110: Well-structured test with proper assertions and cleanup.

The test implementation includes:

Proper warning handling and verification

Structure matching with tight tolerances

Minimum distance validation

File cleanup after assertions

Good test practices demonstrated here.

tests/test_complexes.py (5)

1-10: LGTM!

The module docstring and imports are well-organized and appropriate for a test module.

40-49: LGTM!

The helper function correctly handles file and directory removal with appropriate existence checks.

62-116: Well-designed test for periodicity-breaking edge case.

This test effectively validates that the defect multiplicities are calculated correctly even with periodicity-breaking supercells, and demonstrates where the standard pymatgen approach fails. Good test coverage!

162-252: Comprehensive molecule matching tests.

Excellent test coverage for molecule matching functionality, including translation, rotation, and permutation invariance. The tests properly validate both the custom are_equivalent_molecules function and compare it with pymatgen's KabschMatcher behavior.

254-269: LGTM!

The helper functions are well-implemented. The rotation matrix calculation is mathematically correct.
doped/utils/stenciling.py (1)

1016-1018:
Details

❓ Verification inconclusive

Resolve TODO about removed check.

The TODO comment indicates uncertainty about why a check was removed in commit 40f129f. This should be investigated and resolved.

🏁 Script executed:
#!/bin/bash
# Check git history for context on why the check was removed
git log -p --grep="40f129f" || git show 40f129f
Length of output: 0

I’ll dig deeper into the Git history to find the context for when and why _check_min_dist was removed.

🏁 Script executed:
#!/bin/bash
# Verify if commit 40f129f exists and show its details; otherwise, show recent commits touching the file
if git rev-parse --verify 40f129f >/dev/null 2>&1; then
  git show --stat 40f129f
else
  echo "⚠️ Commit 40f129f not found in history"
  echo
  echo "Listing the most recent commits affecting doped/utils/stenciling.py:"
  git log -n 20 --pretty=format:"%h  %ci  %s" -- doped/utils/stenciling.py
fi
Length of output: 274

I’ll locate the exact file path and inspect its recent Git history for any related removals of _check_min_dist.

🏁 Script executed:
#!/bin/bash
# Find the file path for stenciling.py and show its recent commit history
FILE_PATH=$(fd stenciling.py | head -n1)
if [ -n "$FILE_PATH" ]; then
  echo "Found file at: $FILE_PATH"
  echo "Showing last 20 commits touching $FILE_PATH:"
  git log -n 20 --pretty=format:"%h  %ci  %s" -- "$FILE_PATH"
else
  echo "❌ stenciling.py not found in repo"
fi
Length of output: 328
Investigate rationale behind _check_min_dist call
We couldn’t locate commit 40f129f in the current history, so it’s unclear why this check was removed. Please confirm whether _check_min_dist should be kept or can safely be dropped, then update or remove this TODO accordingly.

• File: doped/utils/stenciling.py
Lines: 1016–1018
    # TODO: This check was removed before (in `40f129f`), not sure if for a proper reason -- check and
    #  remove this note
    _check_min_dist(Structure.from_sites(bulk_outer_cell_sites), min_dist_tol, warning=False)

coderabbitai · 2025-06-09T19:58:10Z

doped/core.py

+def _get_single_valence_oxi_states(bulk_oxi_states: Structure | Composition | dict = False) -> dict:
+    if isinstance(bulk_oxi_states, Structure):
+        single_valence_oxi_states = {
+            el.symbol: el.oxi_state for el in bulk_oxi_states.composition.elements
+        }
+    elif isinstance(bulk_oxi_states, Composition):
+        single_valence_oxi_states = {el.symbol: el.oxi_state for el in bulk_oxi_states.elements}
+    elif isinstance(bulk_oxi_states, dict):
+        single_valence_oxi_states = bulk_oxi_states
+    else:
+        raise TypeError(
+            f"Input bulk_oxi_states must be a pymatgen Structure, Composition or dict, not "
+            f"{type(bulk_oxi_states)}."
+        )
+
+    return single_valence_oxi_states
+


🛠️ Refactor suggestion

Type hint should match default False
The signature declares bulk_oxi_states: Structure | Composition | dict but defaults it to False (a bool). This mismatch will always trigger the TypeError arm for the default. Either include bool in the annotation or switch the default to None (and handle None explicitly) to accurately reflect allowed input types.

🤖 Prompt for AI Agents

In doped/core.py around lines 2616 to 2632, the function _get_single_valence_oxi_states has a type hint indicating bulk_oxi_states can be Structure, Composition, or dict, but the default value is False (a bool), causing a type mismatch and always triggering the TypeError. To fix this, change the default value to None and update the type hint to include Optional for these types. Also, add explicit handling for None in the function to avoid the TypeError for the default case.

coderabbitai · 2025-06-09T19:58:10Z

tests/test_stenciling.py

+        #  consistency with other parts of the code
+        # TODO: Useful test case could be trying to stencil split vacancies in new supercells...
+
+    def test_Se_20_Å_supercell(self):


⚠️ Potential issue

Remove non-ASCII character from function name.

The function name contains the Ångström symbol (Å) which can cause issues in some environments. Consider using ASCII-only names.

-def test_Se_20_Å_supercell(self): +def test_Se_20_A_supercell(self):

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def test_Se_20_Å_supercell(self):

- def test_Se_20_Å_supercell(self):

+ def test_Se_20_A_supercell(self):

🧰 Tools

🪛 Ruff (0.11.9)

67-67: Function name test_Se_20_Å_supercell contains a non-ASCII character

(PLC2401)

🤖 Prompt for AI Agents

In tests/test_stenciling.py at line 67, the test function name contains a non-ASCII character (Å). Rename the function to use only ASCII characters, for example by replacing the Å symbol with 'A' or 'Angstrom' to ensure compatibility across all environments.

coderabbitai · 2025-06-09T19:58:10Z

tests/test_complexes.py

+    def test_symmetry_multiplicity_C2m_Ga2O3(self):
+        """
+        Test symmetry and multiplicity analysis for split vacancies in C2/m
+        Ga2O3.
+
+        Reference values have been manually checked and stored in the info
+        dicts.
+
+        For C2/m Ga2O3, we have no periodicity breaking. Ga site symmetries are
+        all Cs, while we have Cs, C2h, C2 and C1 interstitial site symmetries.
+        """
+        info_dict = self.C2m_Ga2O3_split_vac_info_dict
+        bulk_sga = get_sga(info_dict["bulk_supercell"])
+        bulk_supercell_symm_ops = bulk_sga.get_symmetry_operations()
+        bulk_prim = get_primitive_structure(info_dict["bulk_supercell"])
+        supercell_over_prim_factor = len(info_dict["bulk_supercell"]) / len(bulk_prim)
+        molecule_dict = {}
+
+        ml_calculated_split_vac_dicts = [
+            (energy, subdict)
+            for cation_dict in info_dict.values()
+            if isinstance(cation_dict, dict) and "split_vacancies_energy_dict" in cation_dict
+            for energy, subdict in cation_dict["split_vacancies_energy_dict"].items()
+            if self.energy_key in subdict
+        ]
+
+        for energy, subdict in ml_calculated_split_vac_dicts:
+            orig_split_vacancy = generate_complex_from_defect_sites(
+                info_dict["bulk_supercell"],
+                vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
+                interstitial_sites=[subdict["interstitial_site"]],
+            )
+            orig_split_vacancy_symm = point_symmetry_from_structure(
+                orig_split_vacancy, info_dict["bulk_supercell"], relaxed=True, verbose=False
+            )
+            assert orig_split_vacancy_symm == subdict["unrelaxed_split_vac_symm"]
+
+            for v_idx in [1, 2]:
+                vac_symm = point_symmetry_from_site(
+                    subdict[f"vac_defect_{v_idx}_site"], info_dict["bulk_supercell"]
+                )
+                assert vac_symm == "Cs"  # only Cs Ga sites in C2/m Ga2O3
+                assert vac_symm == subdict[f"vac_{v_idx}_symm"]
+
+            int_symm = point_symmetry_from_site(subdict["interstitial_site"], info_dict["bulk_supercell"])
+            assert int_symm in [
+                "Cs",
+                "C2h",
+                "C2",
+                "C1",
+            ]  # unrelaxed interstitial sites in C2/m Ga2O3
+            assert int_symm == subdict["int_symm"]
+            relaxed_symm = point_symmetry_from_structure(
+                Structure.from_ase_atoms(subdict[self.energy_key]["struct_atoms"]),
+                info_dict["bulk_supercell"],
+                verbose=False,
+            )
+            assert relaxed_symm == subdict["relaxed_split_vac_symm"]
+
+            comp_mult = get_complex_defect_multiplicity(
+                bulk_supercell=info_dict["bulk_supercell"],
+                vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
+                interstitial_sites=[subdict["interstitial_site"]],
+            )
+            assert comp_mult == subdict["multiplicity"]
+            comp_mult = get_complex_defect_multiplicity(
+                bulk_supercell=info_dict["bulk_supercell"],
+                vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
+                interstitial_sites=[subdict["interstitial_site"]],
+                primitive_structure=bulk_prim,
+                supercell_symm_ops=bulk_supercell_symm_ops,
+            )  # same answer with efficiency options
+            assert comp_mult == subdict["multiplicity"]
+            supercell_comp_mult = get_complex_defect_multiplicity(
+                bulk_supercell=info_dict["bulk_supercell"],
+                vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
+                interstitial_sites=[subdict["interstitial_site"]],
+                primitive_structure=bulk_prim,
+                supercell_symm_ops=bulk_supercell_symm_ops,
+                primitive_cell_multiplicity=False,
+            )
+            assert supercell_comp_mult == subdict["multiplicity"] * supercell_over_prim_factor
+
+            int_obj = Interstitial(info_dict["bulk_supercell"], subdict["interstitial_site"])
+            v1_obj = Vacancy(info_dict["bulk_supercell"], subdict["vac_defect_1_site"])
+            v2_obj = Vacancy(info_dict["bulk_supercell"], subdict["vac_defect_2_site"])
+
+            assert int_obj.multiplicity == subdict["int_multiplicity"] * supercell_over_prim_factor
+            assert v1_obj.multiplicity == subdict["vac_1_multiplicity"] * supercell_over_prim_factor
+            assert v2_obj.multiplicity == subdict["vac_2_multiplicity"] * supercell_over_prim_factor
+
+            point_symm_mult_dict = {
+                "Cs": 2,
+                "C2h": 1,
+                "C2": 2,
+                "C1": 4,
+            }
+            for v_idx, v_obj in enumerate([v1_obj, v2_obj]):
+                assert (
+                    v_obj.multiplicity
+                    == point_symm_mult_dict[subdict[f"vac_{v_idx+1}_symm"]] * supercell_over_prim_factor
+                )
+            assert (
+                int_obj.multiplicity
+                == point_symm_mult_dict[subdict["int_symm"]] * supercell_over_prim_factor
+            )
+
+            # this isn't universally true for complex defects, but is for the split vacancies:
+            assert (
+                comp_mult
+                <= (int_obj.multiplicity * v1_obj.multiplicity * v2_obj.multiplicity)
+                / supercell_over_prim_factor**3
+            )
+
+            # in C2/m Ga2O3, each split vacancy has the same multiplicity as the interstitial site
+            assert comp_mult == int_obj.multiplicity / supercell_over_prim_factor
+
+            raw_complex_defect_sites = [
+                subdict["vac_defect_1_site"],
+                subdict["vac_defect_2_site"],
+                subdict["interstitial_site"],
+            ]
+            molecule_dict[round(float(energy), 2)] = Molecule(
+                [site.species for site in raw_complex_defect_sites],
+                info_dict["bulk_supercell"].lattice.get_cartesian_coords(
+                    [
+                        site.frac_coords + raw_complex_defect_sites[0].distance_and_image(site)[1]
+                        for site in raw_complex_defect_sites
+                    ]
+                ),
+            )
+
+        # Here the `relaxed` -3538.32 structure goes to the -3540.98 structure, so can use this in future
+        # tests of relaxed complex defect site determinations! (TODO)
+        for key, mol in molecule_dict.items():
+            for other_key, other_mol in molecule_dict.items():
+                if key != other_key:
+                    assert not are_equivalent_molecules(mol, other_mol)
+


🛠️ Refactor suggestion

Consider breaking down this complex test method.

With 28 local variables, this method is difficult to follow. Similar refactoring as suggested above would improve maintainability.

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 403-403: Too many local variables (28/15)

(R0914)

🤖 Prompt for AI Agents

In tests/test_complexes.py from lines 403 to 541, the test_symmetry_multiplicity_C2m_Ga2O3 method is overly complex with 28 local variables, making it hard to follow. Refactor by splitting this large test into smaller focused helper methods or sub-tests, each handling a specific part of the logic such as symmetry checks, multiplicity assertions, and molecule equivalence tests. This modular approach will improve readability and maintainability without changing the test logic.

coderabbitai · 2025-06-09T19:58:10Z

tests/test_complexes.py

+    def test_symmetry_multiplicity_R3c_Ga2O3(self):
+        """
+        Test symmetry and multiplicity analysis for split vacancies in R-3c
+        Ga2O3.
+
+        Reference values have been manually checked and stored in the info
+        dicts.
+
+        For R-3c Ga2O3, we have periodicity breaking in the generated supercell
+        which breaks the rotational (but not inversion) symmetries, so C3i ->
+        Ci, C2 -> C1. Useful test case for future stenciling to avoid
+        periodicity breaking.
+        """
+        info_dict = self.R3c_Ga2O3_split_vac_info_dict
+        bulk_sga = get_sga(info_dict["bulk_supercell"])
+        bulk_supercell_symm_ops = bulk_sga.get_symmetry_operations()
+        bulk_prim = get_primitive_structure(info_dict["bulk_supercell"])
+        supercell_over_prim_factor = len(info_dict["bulk_supercell"]) / len(bulk_prim)
+        molecule_dict = {}
+
+        ml_calculated_split_vac_dicts = [
+            (energy, subdict)
+            for cation_dict in info_dict.values()
+            if isinstance(cation_dict, dict) and "split_vacancies_energy_dict" in cation_dict
+            for energy, subdict in cation_dict["split_vacancies_energy_dict"].items()
+            if self.energy_key in subdict
+        ]
+
+        for energy, subdict in ml_calculated_split_vac_dicts:
+            orig_split_vacancy = generate_complex_from_defect_sites(
+                info_dict["bulk_supercell"],
+                vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
+                interstitial_sites=[subdict["interstitial_site"]],
+            )
+            orig_split_vacancy_symm = point_symmetry_from_structure(
+                orig_split_vacancy, info_dict["bulk_supercell"], relaxed=True, verbose=False
+            )
+            assert orig_split_vacancy_symm == subdict["unrelaxed_split_vac_symm"]
+
+            for v_idx in [1, 2]:
+                vac_symm = point_symmetry_from_site(
+                    subdict[f"vac_defect_{v_idx}_site"], info_dict["bulk_supercell"]
+                )
+                assert vac_symm == "C3"  # only C3 Ga sites in R-3c Ga2O3
+                assert vac_symm == subdict[f"vac_{v_idx}_symm"]
+
+            int_symm = point_symmetry_from_site(subdict["interstitial_site"], info_dict["bulk_supercell"])
+            assert int_symm in [
+                "C2",
+                "C3i",
+                "D3",
+            ]  # unrelaxed interstitial sites in R-3c Ga2O3
+            assert int_symm == subdict["int_symm"]
+            relaxed_symm = point_symmetry_from_structure(
+                Structure.from_ase_atoms(subdict[self.energy_key]["struct_atoms"]),
+                info_dict["bulk_supercell"],
+                verbose=False,
+            )
+            assert relaxed_symm == subdict["relaxed_split_vac_symm"]
+
+            comp_mult = get_complex_defect_multiplicity(
+                bulk_supercell=info_dict["bulk_supercell"],
+                vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
+                interstitial_sites=[subdict["interstitial_site"]],
+            )
+            assert comp_mult == subdict["multiplicity"]
+            comp_mult = get_complex_defect_multiplicity(
+                bulk_supercell=info_dict["bulk_supercell"],
+                vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
+                interstitial_sites=[subdict["interstitial_site"]],
+                primitive_structure=bulk_prim,
+                supercell_symm_ops=bulk_supercell_symm_ops,
+            )  # same answer with efficiency options
+            assert comp_mult == subdict["multiplicity"]
+            supercell_comp_mult = get_complex_defect_multiplicity(
+                bulk_supercell=info_dict["bulk_supercell"],
+                vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
+                interstitial_sites=[subdict["interstitial_site"]],
+                primitive_structure=bulk_prim,
+                supercell_symm_ops=bulk_supercell_symm_ops,
+                primitive_cell_multiplicity=False,
+            )
+            assert supercell_comp_mult == subdict["multiplicity"] * supercell_over_prim_factor
+
+            int_obj = Interstitial(info_dict["bulk_supercell"], subdict["interstitial_site"])
+            v1_obj = Vacancy(info_dict["bulk_supercell"], subdict["vac_defect_1_site"])
+            v2_obj = Vacancy(info_dict["bulk_supercell"], subdict["vac_defect_2_site"])
+
+            assert int_obj.multiplicity == subdict["int_multiplicity"] * supercell_over_prim_factor
+            assert v1_obj.multiplicity == subdict["vac_1_multiplicity"] * supercell_over_prim_factor
+            assert v2_obj.multiplicity == subdict["vac_2_multiplicity"] * supercell_over_prim_factor
+
+            # this isn't universally true for complex defects, but is for the split vacancies:
+            assert (
+                comp_mult
+                <= (int_obj.multiplicity * v1_obj.multiplicity * v2_obj.multiplicity)
+                / supercell_over_prim_factor**3
+            )
+
+            raw_complex_defect_sites = [
+                subdict["vac_defect_1_site"],
+                subdict["vac_defect_2_site"],
+                subdict["interstitial_site"],
+            ]
+            molecule_dict[round(float(energy), 2)] = Molecule(
+                [site.species for site in raw_complex_defect_sites],
+                info_dict["bulk_supercell"].lattice.get_cartesian_coords(
+                    [
+                        site.frac_coords + raw_complex_defect_sites[0].distance_and_image(site)[1]
+                        for site in raw_complex_defect_sites
+                    ]
+                ),
+            )
+
+        # in this case, two of the molecules are actually equivalent, with slightly different ES energies
+        # because of the periodicity breaking in the generated supercell:
+        for key, mol in molecule_dict.items():
+            if key in [-4245.99, -4246.28]:
+                assert are_equivalent_molecules(molecule_dict[-4245.99], mol)
+            else:
+                assert not are_equivalent_molecules(molecule_dict[-4245.99], mol)
+


🛠️ Refactor suggestion

Refactor to reduce method complexity.

This test method has 24 local variables (pylint warns about >15). Consider extracting helper methods to improve readability and maintainability:

Extract molecule creation logic into a helper method

Extract symmetry validation logic into a separate method

Consider using a dataclass or named tuple for the subdict data structure

Example refactoring approach:

def _validate_site_symmetries(self, subdict, bulk_supercell): """Validate vacancy and interstitial site symmetries.""" for v_idx in [1, 2]: vac_symm = point_symmetry_from_site( subdict[f"vac_defect_{v_idx}_site"], bulk_supercell ) assert vac_symm == "C3" # only C3 Ga sites in R-3c Ga2O3 assert vac_symm == subdict[f"vac_{v_idx}_symm"] int_symm = point_symmetry_from_site(subdict["interstitial_site"], bulk_supercell) assert int_symm in ["C2", "C3i", "D3"] assert int_symm == subdict["int_symm"] return int_symm def _create_molecule_from_sites(self, raw_sites, lattice): """Create molecule from defect sites.""" return Molecule( [site.species for site in raw_sites], lattice.get_cartesian_coords([ site.frac_coords + raw_sites[0].distance_and_image(site)[1] for site in raw_sites ]) )

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 281-281: Too many local variables (24/15)

(R0914)

🤖 Prompt for AI Agents

In tests/test_complexes.py from lines 281 to 402, the test_symmetry_multiplicity_R3c_Ga2O3 method is overly complex with 24 local variables, making it hard to read and maintain. Refactor by extracting the molecule creation logic into a helper method that takes defect sites and lattice to return a Molecule instance, and extract the symmetry validation logic into a separate helper method that asserts vacancy and interstitial site symmetries. Additionally, consider using a dataclass or named tuple to represent the subdict data structure for clarity and easier access to its fields.

coderabbitai · 2025-06-09T19:58:10Z

tests/test_complexes.py

+    def test_symmetry_multiplicity_tricky_structures(self):
+        """
+        Test symmetry and multiplicity analysis for split vacancies in some
+        complex structures.
+
+        Some of these had issues with initial symmetry/multiplicity analysis
+        prior to robustness updates. Here we test that the analysis proceeds
+        fine and matches expected values.
+        """
+        database_folder = f"{data_dir}/Split_Vacancies"
+        formula_mpid_files = [i for i in os.listdir(database_folder) if "Ga2O3" not in i]
+        for formula_mpid_file in formula_mpid_files:
+            info_dict = loadfn(f"{database_folder}/{formula_mpid_file}")
+            bulk_sga = get_sga(info_dict["bulk_supercell"])
+            bulk_supercell_symm_ops = bulk_sga.get_symmetry_operations()
+            bulk_prim = get_primitive_structure(info_dict["bulk_supercell"])
+            supercell_over_prim_factor = len(info_dict["bulk_supercell"]) / len(bulk_prim)
+            molecule_dict = {}
+
+            ml_calculated_split_vac_dicts = [
+                (energy, subdict)
+                for cation_dict in info_dict.values()
+                if isinstance(cation_dict, dict) and "split_vacancies_energy_dict" in cation_dict
+                for energy, subdict in cation_dict["split_vacancies_energy_dict"].items()
+                if self.energy_key in subdict
+            ]
+            if "GeI2" not in formula_mpid_file and "Li3Ti4O8" not in formula_mpid_file:
+                # take a random subset of max 10:
+                ml_calculated_split_vac_dicts = random.sample(
+                    ml_calculated_split_vac_dicts, min(10, len(ml_calculated_split_vac_dicts))
+                )
+            for energy, subdict in ml_calculated_split_vac_dicts:
+                orig_split_vacancy = generate_complex_from_defect_sites(
+                    info_dict["bulk_supercell"],
+                    vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
+                    interstitial_sites=[subdict["interstitial_site"]],
+                )
+                orig_split_vacancy_symm = point_symmetry_from_structure(
+                    orig_split_vacancy, info_dict["bulk_supercell"], relaxed=True, verbose=False
+                )
+                assert orig_split_vacancy_symm == subdict["unrelaxed_split_vac_symm"]
+
+                for v_idx in [1, 2]:
+                    vac_symm = point_symmetry_from_site(
+                        subdict[f"vac_defect_{v_idx}_site"], info_dict["bulk_supercell"]
+                    )
+                    assert vac_symm == subdict[f"vac_{v_idx}_symm"]
+
+                int_symm = point_symmetry_from_site(
+                    subdict["interstitial_site"], info_dict["bulk_supercell"]
+                )
+                assert int_symm == subdict["int_symm"]
+                relaxed_symm = point_symmetry_from_structure(
+                    Structure.from_ase_atoms(subdict[self.energy_key]["struct_atoms"]),
+                    info_dict["bulk_supercell"],
+                    verbose=False,
+                )
+                assert relaxed_symm == subdict["relaxed_split_vac_symm"]
+
+                comp_mult = get_complex_defect_multiplicity(
+                    bulk_supercell=info_dict["bulk_supercell"],
+                    vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
+                    interstitial_sites=[subdict["interstitial_site"]],
+                )  # this breaks if we get a multiplicity less than expected minimum
+                assert comp_mult == subdict["multiplicity"]
+                comp_mult = get_complex_defect_multiplicity(
+                    bulk_supercell=info_dict["bulk_supercell"],
+                    vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
+                    interstitial_sites=[subdict["interstitial_site"]],
+                    primitive_structure=bulk_prim,
+                    supercell_symm_ops=bulk_supercell_symm_ops,
+                )  # same answer with efficiency options
+                assert comp_mult == subdict["multiplicity"]
+                supercell_comp_mult = get_complex_defect_multiplicity(
+                    bulk_supercell=info_dict["bulk_supercell"],
+                    vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
+                    interstitial_sites=[subdict["interstitial_site"]],
+                    primitive_structure=bulk_prim,
+                    supercell_symm_ops=bulk_supercell_symm_ops,
+                    primitive_cell_multiplicity=False,
+                )
+                assert supercell_comp_mult == subdict["multiplicity"] * supercell_over_prim_factor
+
+                int_obj = Interstitial(info_dict["bulk_supercell"], subdict["interstitial_site"])
+                v1_obj = Vacancy(info_dict["bulk_supercell"], subdict["vac_defect_1_site"])
+                v2_obj = Vacancy(info_dict["bulk_supercell"], subdict["vac_defect_2_site"])
+
+                assert int_obj.multiplicity == subdict["int_multiplicity"] * supercell_over_prim_factor
+                assert v1_obj.multiplicity == subdict["vac_1_multiplicity"] * supercell_over_prim_factor
+                assert v2_obj.multiplicity == subdict["vac_2_multiplicity"] * supercell_over_prim_factor
+
+                for v_idx, v_obj in enumerate([v1_obj, v2_obj]):
+                    assert (
+                        v_obj.multiplicity
+                        == subdict[f"vac_{v_idx+1}_multiplicity"] * supercell_over_prim_factor
+                    )
+                assert int_obj.multiplicity == subdict["int_multiplicity"] * supercell_over_prim_factor
+
+                # this isn't universally true for complex defects, but is for the split vacancies:
+                assert (
+                    comp_mult
+                    <= (int_obj.multiplicity * v1_obj.multiplicity * v2_obj.multiplicity)
+                    / supercell_over_prim_factor**3
+                )
+                assert comp_mult >= int_obj.multiplicity / supercell_over_prim_factor
+
+                raw_complex_defect_sites = [
+                    subdict["vac_defect_1_site"],
+                    subdict["vac_defect_2_site"],
+                    subdict["interstitial_site"],
+                ]
+                molecule_dict[round(float(energy), 2)] = Molecule(
+                    [site.species for site in raw_complex_defect_sites],
+                    info_dict["bulk_supercell"].lattice.get_cartesian_coords(
+                        [
+                            site.frac_coords + raw_complex_defect_sites[0].distance_and_image(site)[1]
+                            for site in raw_complex_defect_sites
+                        ]
+                    ),
+                )
+
+            eq_mol_count = 0
+            for key, mol in molecule_dict.items():
+                for other_key, other_mol in molecule_dict.items():
+                    if (  # same cations, different molecules
+                        sorted(mol.atomic_numbers) == sorted(other_mol.atomic_numbers) and key != other_key
+                    ) and are_equivalent_molecules(mol, other_mol):
+                        eq_mol_count += 1
+
+            if "GeI2" not in formula_mpid_file and "Li3Ti4O8" not in formula_mpid_file:
+                assert eq_mol_count == 0
+            elif "GeI2" in formula_mpid_file:
+                assert eq_mol_count == 8
+            elif "Li3Ti4O8" in formula_mpid_file:
+                assert eq_mol_count == 80
+            # Note that a lot of these have matching equivalent relaxed complexes, so would make good test


🛠️ Refactor suggestion

Extract common test logic to reduce complexity.

This method has 31 local variables and significant duplication with the previous test methods. Consider creating a base test method that handles the common validation logic.

Consider creating a shared validation method:

def _validate_split_vacancy_data(self, info_dict, energy_key, expected_eq_mol_count=0): """Common validation logic for split vacancy data.""" bulk_sga = get_sga(info_dict["bulk_supercell"]) bulk_supercell_symm_ops = bulk_sga.get_symmetry_operations() bulk_prim = get_primitive_structure(info_dict["bulk_supercell"]) # Extract and validate split vacancy data ml_calculated_split_vac_dicts = self._extract_ml_split_vac_dicts(info_dict, energy_key) molecule_dict = {} for energy, subdict in ml_calculated_split_vac_dicts: self._validate_single_split_vacancy(subdict, info_dict, bulk_prim, bulk_supercell_symm_ops) molecule_dict[round(float(energy), 2)] = self._create_molecule_from_sites( [subdict["vac_defect_1_site"], subdict["vac_defect_2_site"], subdict["interstitial_site"]], info_dict["bulk_supercell"].lattice ) # Validate molecule equivalences eq_mol_count = self._count_equivalent_molecules(molecule_dict) assert eq_mol_count == expected_eq_mol_count

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 542-542: Too many local variables (31/15)

(R0914)

🤖 Prompt for AI Agents

In tests/test_complexes.py lines 542 to 677, the test_symmetry_multiplicity_tricky_structures method is overly complex with many local variables and duplicated logic from other tests. Refactor by extracting the common validation logic into a separate helper method, for example _validate_split_vacancy_data, which takes info_dict, energy_key, and expected_eq_mol_count as parameters. Move the symmetry, multiplicity, and molecule equivalence checks into this helper, and call it from the test method to reduce duplication and improve readability.

coderabbitai · 2025-06-09T19:58:11Z

doped/complexes.py

+            for frac_coords in get_equiv_frac_coords_in_primitive(
+                site.frac_coords,
+                primitive_structure,
+                bulk_supercell,
+                symm_ops=supercell_symm_ops,  # only small speed up (due to caching)
+                symprec=symprec,
+                dist_tol=dist_tol,
+            )


💡 Verification agent

🧩 Analysis chain

Verify that get_equiv_frac_coords_in_primitive accepts the keyword arguments.

Static analysis indicates that symm_ops and dist_tol may be unexpected keyword arguments for this function.

🏁 Script executed:

#!/bin/bash # Search for the function definition to verify accepted parameters ast-grep --pattern 'def get_equiv_frac_coords_in_primitive($$$)'

Length of output: 15931

Invalid keyword arguments passed to get_equiv_frac_coords_in_primitive

The signature of get_equiv_frac_coords_in_primitive (see doped/utils/symmetry.py:1223) is:

def get_equiv_frac_coords_in_primitive( frac_coords, primitive, supercell, symprec=0.01, dist_tol_factor=1.0, equiv_coords=True, return_symprec_and_dist_tol_factor=False, fixed_symprec_and_dist_tol_factor=False, verbose=False, ) -> ...

It does not accept symm_ops or dist_tol. Please update the call in doped/complexes.py accordingly:

Remove the symm_ops=supercell_symm_ops argument.

Replace dist_tol=… with dist_tol_factor=… (or drop it if you intend to use the default).

Location:

File: doped/complexes.py

Lines: ~366–373

Suggested diff:

- for frac_coords in get_equiv_frac_coords_in_primitive( - site.frac_coords, - primitive_structure, - bulk_supercell, - symm_ops=supercell_symm_ops, - symprec=symprec, - dist_tol=dist_tol, - ) + for frac_coords in get_equiv_frac_coords_in_primitive( + site.frac_coords, + primitive_structure, + bulk_supercell, + symprec=symprec, + dist_tol_factor=dist_tol_factor, # if you want to override the default + )

🧰 Tools

🪛 Pylint (3.3.7)

[error] 366-373: Unexpected keyword argument 'symm_ops' in function call

(E1123)

[error] 366-373: Unexpected keyword argument 'dist_tol' in function call

(E1123)

🤖 Prompt for AI Agents

In doped/complexes.py around lines 366 to 373, the call to get_equiv_frac_coords_in_primitive incorrectly passes symm_ops and dist_tol as keyword arguments, which are not accepted by the function. Remove the symm_ops=supercell_symm_ops argument entirely and replace dist_tol=dist_tol with dist_tol_factor=dist_tol if you want to specify it, or omit it to use the default value. Adjust the call accordingly to match the function signature in doped/utils/symmetry.py.

coderabbitai · 2025-06-09T19:58:11Z

doped/utils/stenciling.py

+
+def _convert_defect_neighbours_to_X(
+    defect_supercell: Structure,
+    defect_position: np.ndarray[float],


⚠️ Potential issue

Fix all remaining numpy array type hints.

Multiple functions have the same invalid numpy type hint syntax.

Apply this pattern throughout the file:

# Line 576 - defect_position: np.ndarray[float], + defect_position: np.ndarray, # Line 1025 - defect_frac_coords: np.ndarray[float] | list[float] | None = None, + defect_frac_coords: np.ndarray | list[float] | None = None, # Line 1097 - lattice: Lattice, cart_coords: np.ndarray[float] | list[float], tol: float = 1e-5 + lattice: Lattice, cart_coords: np.ndarray | list[float], tol: float = 1e-5

Also applies to: 1025-1025, 1097-1097

🧰 Tools

🪛 Pylint (3.3.7)

[error] 576-576: Value 'np.ndarray' is unsubscriptable

(E1136)

🤖 Prompt for AI Agents

In doped/utils/stenciling.py at lines 576, 1025, and 1097, the numpy array type hints use an invalid syntax like np.ndarray[float]. Replace these with the correct typing syntax by importing numpy.typing as npt and using npt.NDArray with the appropriate dtype, for example npt.NDArray[np.float64]. Update all such type hints in these lines accordingly.

coderabbitai · 2025-06-09T19:58:11Z

doped/utils/stenciling.py

+def get_defect_in_supercell(
+    defect_entry: DefectEntry,
+    target_supercell: Structure,
+    check_bulk: bool = True,
+    target_frac_coords: np.ndarray[float] | list[float] | bool = True,
+    edge_tol: float = 1,
+    min_dist_tol_factor: float = 0.99,
+) -> tuple[Structure, Structure]:
+    """
+    Re-generate a relaxed defect structure in a different supercell.
+
+    This function takes the relaxed defect structure of the input
+    ``DefectEntry`` (from ``DefectEntry.defect_supercell``) and re-generates it
+    in the ``target_supercell`` structure, and the closest possible position to
+    ``target_frac_coords`` (default is the supercell centre = [0.5, 0.5, 0.5]),
+    also providing the corresponding bulk supercell (which should be the same
+    for each generated defect supercell given the same ``target_supercell`` and
+    base supercell for ``defect_entry``, see note below).
+
+    ``target_supercell`` should be the same host crystal structure, just with
+    different supercell dimensions, having the same lattice parameters and bond
+    lengths.
+
+    Note: This function does `not` guarantee that the generated defect
+    supercell atomic position basis exactly matches that of
+    ``target_supercell``, which may have come from a different primitive
+    structure definition (e.g. CdTe with
+    ``{"Cd": [0,0,0], "Te": [0.25,0.25,0.25]}`` vs
+    ``{"Cd": [0,0,0], "Te": [0.75,0.75,0.75]}``). The generated supercell
+    `will` have the exact same lattice/cell definition with fully
+    symmetry-equivalent atom positions, but if the actual position basis
+    differs then this can cause issues with parsing finite-size corrections
+    (which rely on site-matched potentials). This is perfectly fine if it
+    occurs, just will require the use of a matching bulk/reference supercell
+    when parsing (rather than the input ``target_supercell``) -- ``doped`` will
+    also throw a warning about this when parsing if a non-matching bulk
+    supercell is used anyway. This function will automatically check if the
+    position basis in the generated supercell differs from that of
+    ``target_supercell``, printing a warning if so (unless ``check_bulk`` is
+    ``False``) and returning the corresponding bulk supercell which should be
+    used for parsing defect calculations with the generated supercell. Of
+    course, if generating multiple defects in the same ``target_supercell``,
+    only one such bulk supercell calculation should be required (should
+    correspond to the same bulk supercell in each case).
+
+    Briefly, this function works by:
+
+    - Translating the defect site to the centre of the original supercell.
+    - Identifying a super-supercell which fully encompasses the target
+      supercell (regardless of orientation).
+    - Generate this super-supercell, using one copy of the original defect
+      supercell (``DefectEntry.defect_supercell``), and the rest of the sites
+      (outside of the original defect supercell box, with the defect translated
+      to the centre) are populated using the bulk supercell
+      (``DefectEntry.bulk_supercell``).
+    - Translate the defect site in this super-supercell to the Cartesian
+      coordinates of the centre of ``target_supercell``, then stencil out all
+      sites in the ``target_supercell`` portion of the super-supercell,
+      accounting for possible site displacements in the relaxed defect
+      supercell (e.g. if ``target_supercell`` has a different shape and does
+      not fully encompass the original defect supercell).
+      This is done by scanning over possible combinations of sites near the
+      boundary regions of the ``target_supercell`` portion, and identifying the
+      combination which maximises the minimum inter-atomic distance in the new
+      supercell (i.e. the most bulk-like arrangement).
+    - Re-orient this new stenciled supercell to match the orientation and site
+      positions of ``target_supercell``.
+    - If ``target_frac_coords`` is not ``False``, scan over all symmetry
+      operations of ``target_supercell`` and apply that which places the defect
+      site closest to ``target_frac_coords``.
+
+    Args:
+        defect_entry (DefectEntry):
+            A ``DefectEntry`` object for which to re-generate the relaxed
+            structure (taken from ``DefectEntry.defect_supercell``) in the
+            ``target_supercell`` lattice.
+        target_supercell (Structure):
+            The supercell structure to re-generate the relaxed defect structure
+            in.
+        check_bulk (bool):
+            Whether to check if the generated defect/bulk supercells have
+            different atomic position bases to ``target_supercell`` (as
+            described above) -- if so, a warning will be printed (unless
+            ``check_bulk`` is ``False``). Default is ``True``.
+        target_frac_coords (Union[np.ndarray[float], list[float], bool]):
+            The fractional coordinates to target for defect placement in the
+            new supercell. If just set to ``True`` (default), will try to place
+            the defect nearest to the centre of the superset cell (i.e.
+            ``target_frac_coords = [0.5, 0.5, 0.5]``), as is default in
+            ``doped`` defect generation. Note that defect placement is harder
+            in this case than in generation with ``DefectsGenerator``, as we
+            are not starting from primitive cells and we are working with
+            relaxed geometries.
+        edge_tol (float):
+            A tolerance (in Angstrom) for site displacements at the edge of the
+            stenciled supercell, when determining the best match of sites to
+            stencil out in the new supercell (of ``target_supercell``
+            dimension). Default is 1 Angstrom, and then this is sequentially
+            increased up to 4.5 Angstrom if the initial scan fails.
+        min_dist_tol_factor (float):
+            Tolerance factor when checking the minimum interatomic distance in
+            the stenciled defect supercell, as a factor of the minimum distance
+            in the original ``DefectEntry.defect_supercell``. Default is 0.99.
+
+    Returns:
+        tuple[Structure, Structure]:
+            The re-generated defect supercell in the ``target_supercell``
+            lattice, and the corresponding bulk/reference supercell for the
+            generated defect supercell (see explanations above).
+    """
+    # Note to self; using Pycharm breakpoints throughout is likely easiest for debugging (TODO)
+    # TODO: Tests!! (At least one of each defect, Se good test case, then at least one or two with
+    #  >unary compositions and extrinsic substitution/interstitial)
+    # TODO: We should now be able to use these functions (without the final re-orientation step,
+    #  for speed) to determine the point symmetries of relaxed defects in non-symmetry-conserving
+    #  supercells, by stenciling into a small symmetry-conserving cell and getting the point symmetry
+    #  for that -- will do!
+    # TODO: When releasing, note that this code is useful for supercell size convergence tests, and
+    # accelerating `ShakeNBreak` etc. If/when adding, make sure to link in `SnB` docs as well. Should have
+    # some notes in docs/tutorials about this functionality, maybe with a quick example (e.g. Se results).
+    # Likewise, the displacements functions are useful as a validation / check of supercell size
+    # convergence, and for quantifying the strain / distortion introduced by a certain defect
+
+    pbar = tqdm(
+        total=100, bar_format="{desc}{percentage:.1f}%|{bar}| [{elapsed},  {rate_fmt}{postfix}]"
+    )  # tqdm progress bar. 100% is completion
+    pbar.set_description("Getting super-supercell (relaxed defect + bulk sites)")
+
+    bulk_mismatch_warning = False
+
+    try:
+        orig_supercell = _get_defect_supercell(defect_entry)
+        orig_min_dist = min_dist(orig_supercell)
+        min_dist_tol = orig_min_dist * min_dist_tol_factor
+        orig_bulk_supercell = _get_bulk_supercell(defect_entry)
+        orig_defect_frac_coords = defect_entry.sc_defect_frac_coords
+        target_supercell = target_supercell.copy()
+        bulk_min_bond_length = min_dist(orig_bulk_supercell)
+        bulk_min_dist_tol = bulk_min_bond_length * min_dist_tol_factor
+        target_frac_coords = [0.5, 0.5, 0.5] if target_frac_coords is True else target_frac_coords
+
+        # ensure no oxidation states (for easy composition matching later)
+        for struct in [orig_supercell, orig_bulk_supercell, target_supercell]:
+            struct.remove_oxidation_states()
+
+        # first translate both orig supercells to put defect in the middle, to aid initial stenciling:
+        orig_def_to_centre = np.array([0.5, 0.5, 0.5]) - orig_defect_frac_coords
+        orig_supercell = translate_structure(orig_supercell, orig_def_to_centre, frac_coords=True)
+        trans_orig_bulk_supercell = translate_structure(
+            orig_bulk_supercell, orig_def_to_centre, frac_coords=True
+        )
+
+        # get big_supercell, which is expanded version of orig_supercell so that _each_ lattice vector is
+        # now bigger than the _largest_ lattice vector in target_supercell (so that target_supercell is
+        # fully encompassed by big_supercell):
+        superset_matrix, big_supercell, big_supercell_with_X, orig_supercell_with_X = (  # supa-set!!
+            _get_superset_matrix_and_supercells(orig_supercell, target_supercell, [0.5, 0.5, 0.5])
+        )
+        big_bulk_supercell = orig_bulk_supercell * superset_matrix  # get big bulk supercell
+
+        # this big_supercell is with the defect now repeated in it, but we want it with just one defect,
+        # so only take the first repeated defect supercell, and then the rest of the sites from the
+        # expanded bulk supercell:
+        # only keep atoms in big supercell which are within the original supercell bounds:
+        big_defect_supercell = _get_matching_sites_from_s1_then_s2(
+            orig_supercell_with_X,
+            big_supercell_with_X,
+            trans_orig_bulk_supercell * superset_matrix,
+            min_dist_tol,
+        )
+
+        # translate structure to put defect at the centre of the big supercell (w/frac_coords)
+        big_supercell_defect_site = next(s for s in big_defect_supercell.sites if s.specie.symbol == "X")
+        def_to_centre = np.array([0.5, 0.5, 0.5]) - big_supercell_defect_site.frac_coords
+        big_defect_supercell = translate_structure(big_defect_supercell, def_to_centre, frac_coords=True)
+
+        pbar.update(20)  # 20% of progress bar
+        pbar.set_description("Getting sites in border region")
+
+        # get all atoms in big supercell within the cartesian bounds of the target_supercell supercell:
+        new_defect_supercell = _stencil_target_cell_from_big_cell(
+            big_defect_supercell,
+            target_supercell,
+            bulk_min_bond_length=bulk_min_bond_length,
+            min_dist_tol=min_dist_tol,
+            edge_tol=edge_tol,
+            pbar=pbar,
+        )
+        new_bulk_supercell = _stencil_target_cell_from_big_cell(
+            big_bulk_supercell,
+            target_supercell,
+            bulk_min_bond_length=bulk_min_bond_length,
+            min_dist_tol=bulk_min_dist_tol,
+            edge_tol=1e-3,
+            pbar=None,
+        )  # shouldn't need `edge_tol`, should be much faster than defect supercell stencil
+
+        pbar.update(15)  # 55% of progress bar
+        pbar.set_description("Ensuring matching orientation w/target_supercell")
+
+        # now we just do get s2 like s1 to get the orientation right:
+        defect_site = next(site for site in new_defect_supercell if site.specie.symbol == "X")
+
+        # Note: this function is typically the main bottleneck in this workflow. We have already
+        # optimised the underlying ``StructureMatcher`` workflow in many ways (caching,
+        # fast structure/site/composition comparisons, skipping comparison of defect neighbourhood to
+        # reduce requisite ``stol`` etc; being many orders of magnitude faster than the base
+        # ``pymatgen`` ``StructureMatcher``), however the ``_cart_dists()`` function call is
+        # still quite expensive, especially with large structures with significant noise in the atomic
+        # positions...
+        new_defect_supercell_w_defect_neighbours_as_X = _convert_defect_neighbours_to_X(
+            new_defect_supercell, defect_site.frac_coords, coords_are_cartesian=False
+        )
+        with warnings.catch_warnings():
+            warnings.filterwarnings("ignore", message="Not all sites have property orig_species")
+            # first we orient the generated _bulk_ supercell to match the ``target_supercell``,
+            # to try ensure consistency in the generated supercells
+            oriented_new_bulk_supercell = orient_s2_like_s1(  # speed should be >= defect orienting
+                target_supercell,
+                new_bulk_supercell,
+                verbose=False,
+                allow_subset=True,
+            )
+            # return new_defect_supercell, _round_struct_coords(oriented_new_bulk_supercell,
+            #                                                   dist_precision=0.01, to_unit_cell=True)
+            oriented_new_defect_supercell_w_defect_neighbours_as_X = orient_s2_like_s1(
+                oriented_new_bulk_supercell,
+                new_defect_supercell_w_defect_neighbours_as_X,
+                verbose=False,
+                ignored_species=["X"],  # ignore X site
+                allow_subset=True,  # allow defect supercell composition to differ
+            )
+            oriented_new_defect_supercell = _convert_X_back_to_orig_species(
+                oriented_new_defect_supercell_w_defect_neighbours_as_X
+            )
+            oriented_new_defect_site = next(  # get defect site and remove X from sites
+                oriented_new_defect_supercell.pop(i)
+                for i, site in enumerate(oriented_new_defect_supercell.sites)
+                if site.specie.symbol == "X"
+            )
+
+            pbar.update(35)  # 90% of progress bar
+
+            if target_frac_coords is not False:
+                pbar.set_description("Placing defect closest to target_frac_coords")
+
+                target_symm_op = _scan_symm_ops_to_place_site_closest_to_frac_coords(
+                    target_supercell, oriented_new_defect_site, target_frac_coords
+                )
+                oriented_new_defect_supercell = get_clean_structure(
+                    apply_symm_op_to_struct(  # apply symm_op to structure
+                        target_symm_op,
+                        oriented_new_defect_supercell,
+                        fractional=True,
+                        rotate_lattice=False,
+                    ),
+                )  # reordered inputs in updated doped
+                if check_bulk:
+                    bulk_mismatch_warning = not check_atom_mapping_far_from_defect(
+                        target_supercell,
+                        oriented_new_bulk_supercell,
+                        oriented_new_defect_site.frac_coords,
+                        warning=False,
+                    )
+
+                oriented_new_bulk_supercell = get_clean_structure(
+                    apply_symm_op_to_struct(
+                        target_symm_op,
+                        oriented_new_bulk_supercell,
+                        fractional=True,
+                        rotate_lattice=False,
+                    ),
+                )
+
+        pbar.update(pbar.total - pbar.n)  # set to 100% of progress bar
+
+    except Exception as e:
+        pbar.close()
+        raise e
+
+    finally:
+        pbar.close()
+
+    if bulk_mismatch_warning:  # print warning after closing pbar; cleaner
+        warnings.warn(
+            "Note that the atomic position basis of the generated defect/bulk supercell "
+            "differs from that of the ``target_supercell``. This is likely fine, "
+            "and just due to differences in (symmetry-equivalent) primitive cell definitions "
+            "(e.g. {'Cd': [0,0,0], 'Te': [0.25,0.25,0.25]} vs "
+            "{'Cd(': [0,0,0], 'Te': [0.75,0.75,0.75]}``) -- see ``get_defect_in_supercell`` "
+            "docstring for more info. For accurate finite-size charge corrections when "
+            "parsing, a matching bulk and defect supercell should be used, and so the "
+            "matching bulk supercell for the generated defect supercell (also returned "
+            "by this function) should be used for its reference host cell calculation.",
+        )
+
+    _check_min_dist(oriented_new_defect_supercell, min_dist_tol)  # check distances are reasonable
+    _check_min_dist(oriented_new_bulk_supercell, bulk_min_dist_tol)
+    return oriented_new_defect_supercell, oriented_new_bulk_supercell
+


🛠️ Refactor suggestion

Consider breaking down this complex function.

With 56 statements and 36 local variables, this function is difficult to understand and maintain. The extensive inline comments and progress bar updates suggest natural boundaries for extraction.

Consider extracting these logical sections into separate functions:

Super-supercell generation (lines 175-220)

Stenciling logic (lines 224-240)

Orientation matching (lines 245-284)

Defect placement (lines 287-317)

This would improve testability and make the overall flow clearer.

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 44-44: Too many arguments (6/5)

(R0913)

[refactor] 44-44: Too many positional arguments (6/5)

(R0917)

[refactor] 44-44: Too many local variables (36/15)

(R0914)

[error] 48-48: Value 'np.ndarray' is unsubscriptable

(E1136)

[refactor] 44-44: Too many statements (56/50)

(R0915)

🤖 Prompt for AI Agents

In doped/utils/stenciling.py around lines 44 to 343, the get_defect_in_supercell function is overly long and complex with 56 statements and 36 local variables, making it hard to understand and maintain. Refactor by extracting the following logical sections into separate helper functions: super-supercell generation (lines 175-220), stenciling logic (lines 224-240), orientation matching (lines 245-284), and defect placement (lines 287-317). Replace the corresponding code blocks in the main function with calls to these new functions to improve readability, testability, and maintainability.

coderabbitai · 2025-06-09T19:58:11Z

doped/utils/stenciling.py

+def _scan_symm_ops_to_place_site_closest_to_frac_coords(
+    symm_ops: Structure | Sequence[SymmOp],
+    site: PeriodicSite,
+    target_frac_coords: np.ndarray[float] | list[float] | None = None,


⚠️ Potential issue

Fix invalid numpy array type hint syntax.

Same issue as above with invalid numpy type hint.

- target_frac_coords: np.ndarray[float] | list[float] | None = None, + target_frac_coords: np.ndarray | list[float] | None = None,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

target_frac_coords: np.ndarray[float] | list[float] | None = None,

target_frac_coords: np.ndarray | list[float] | None = None,

🧰 Tools

🪛 Pylint (3.3.7)

[error] 348-348: Value 'np.ndarray' is unsubscriptable

(E1136)

🤖 Prompt for AI Agents

In doped/utils/stenciling.py at line 348, the type hint for a numpy array is incorrectly specified as np.ndarray[float]. Replace this with the correct syntax using numpy.typing.NDArray, such as numpy.typing.NDArray[np.float64], or use a more general type hint compatible with numpy arrays. Adjust the import statements if necessary to include numpy.typing for proper type hinting.

coderabbitai · 2025-06-09T19:58:12Z

doped/utils/stenciling.py

+    defect_entry: DefectEntry,
+    target_supercell: Structure,
+    check_bulk: bool = True,
+    target_frac_coords: np.ndarray[float] | list[float] | bool = True,


⚠️ Potential issue

Fix invalid numpy array type hint syntax.

The type hint np.ndarray[float] is not valid Python syntax. NumPy arrays don't support direct subscript notation in type hints.

- target_frac_coords: np.ndarray[float] | list[float] | bool = True, + target_frac_coords: np.ndarray | list[float] | bool = True,

For more specific typing, consider using numpy.typing.NDArray[np.float64] with appropriate imports.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

target_frac_coords: np.ndarray[float] | list[float] | bool = True,

target_frac_coords: np.ndarray | list[float] | bool = True,

🧰 Tools

🪛 Pylint (3.3.7)

[error] 48-48: Value 'np.ndarray' is unsubscriptable

(E1136)

🤖 Prompt for AI Agents

In doped/utils/stenciling.py at line 48, the type hint np.ndarray[float] is invalid syntax. Replace it with numpy.typing.NDArray[np.float64] after importing NDArray from numpy.typing and np.float64 from numpy to correctly specify the array type in the type hint.

…ching, and use `dist_tol_factor`)

…ions to `doped.complexes`

…on efficient

… in defect site parsing functions

…efore...)

…defect_type_and_site_indices` and move unrelaxed defect structure code out to `defect_from_structures`

… complex defects

… make `site_tol` apply element-wise

…e `None`

…generation, and test

kavanase changed the base branch from main to develop June 9, 2025 19:48

coderabbitai bot reviewed Jun 9, 2025

View reviewed changes

kavanase force-pushed the split_vacancies branch 2 times, most recently from b304253 to f0d7be7 Compare June 26, 2025 22:21

kavanase force-pushed the split_vacancies branch from f0d7be7 to f6dce12 Compare July 3, 2025 21:13

kavanase force-pushed the split_vacancies branch from 6ef1f97 to 3631bc8 Compare August 12, 2025 04:47

kavanase force-pushed the split_vacancies branch from 3631bc8 to 03042cd Compare October 15, 2025 20:55

kavanase added 22 commits October 16, 2025 14:44

Fresh split vacancies code

d6ef1dd

Remove handling for older pymatgen (outside supported window now)

7941520

Rebase to latest develop (cut symm_ops usage, unnecessary with ca…

0467a3a

…ching, and use `dist_tol_factor`)

Add additional kwargs to DefectsGenerator

45a0ba4

Update get_split_vacancies, full draft added

8dcc4f5

Add molecule_from_sites and get_split_vacancies_by_geometry funct…

89b6e26

…ions to `doped.complexes`

Use EwaldMinimizer for fast electrostatic evaluation

f295a18

Expand EwaldMinimizer usage and make electrostatic complex generati…

3151ed1

…on efficient

Move get_split_vacancies_from_electrostatics to doped.complexes

9428c70

Some cleanup and updates

6c2d9d0

Move get_split_vacancies to doped.complexes

38f700d

Add DOI and update paper list

03ee652

Efficiency and robustness improvements, and reducing code redundancy,…

9fa5a6f

… in defect site parsing functions

Remove symm_ops usages from test_complexes (thought we did this b…

1ffec30

…efore...)

Refactor get_defect_type_site_idxs_and_unrelaxed_structure to `get_…

e47b4ca

…defect_type_and_site_indices` and move unrelaxed defect structure code out to `defect_from_structures`

Further streamlining of get_defect_type_and_site_indices, ready for…

0566ae2

… complex defects

Add use_oxi_states option to get_defect_type_and_site_indices and…

df814d7

… make `site_tol` apply element-wise

Use get_defect_type_and_site_indices in classify_vacancy_geometry

52aa444

Use get_defect_type_and_site_indices in classify_vacancy_geometry

ef63b2a

Add use_rms option and update oxi-state handling

76da3b2

Guard DefectEntry.__eq__() case where sc_entry or bulk_entry ar…

7604823

…e `None`

Minor redundancy updates (leftover from pmg defects)

2d6401d

kavanase added 3 commits October 16, 2025 14:44

Add warning when input defect sites are the same, for complex defect …

1e629a7

…generation, and test

Complex defect multiplicity updates

4cf597f

Rebase on to latest develop

1a6a8d7

kavanase force-pushed the split_vacancies branch from 03042cd to 1a6a8d7 Compare October 16, 2025 18:45

	def test_Se_20_Å_supercell(self):
	- def test_Se_20_Å_supercell(self):
	+ def test_Se_20_A_supercell(self):

	target_frac_coords: np.ndarray[float] \| list[float] \| None = None,
	target_frac_coords: np.ndarray \| list[float] \| None = None,

Split Vacancies and Foundation Defect Complexes Code #127

Are you sure you want to change the base?

Split Vacancies and Foundation Defect Complexes Code #127

Uh oh!

Conversation

kavanase commented Jun 9, 2025 • edited by romani-95 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kavanase commented Jun 9, 2025 •

edited by romani-95

Loading

coderabbitai bot commented Jun 9, 2025 •

edited

Loading