Skip to content

Conversation

@kavanase
Copy link
Member

@kavanase kavanase commented Jun 9, 2025

See original commit names here: #119
Squashed and restarted after repo clean.

Mostly for tracking for now. Main TODO is to get automated split vacancy generation from ML-calculated database fully implemented. (General) complex defect symmetry and multiplicity analysis (& thus concentrations/thermodynamics) is now implemented, will also add automated complex defect structure recognition in parsing.

Key TODOs:

  • Split vacancy generation from database
  • (Efficient) complex defect structure recognition

@coderabbitai
Copy link

coderabbitai bot commented Jun 9, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This update introduces major new functionality for handling complex defect structures in crystalline materials. It adds modules for generating, classifying, and analyzing defect complexes, including split vacancies, and provides robust stenciling utilities for mapping relaxed defect structures into arbitrary supercells. Comprehensive documentation and extensive tests for these features are included.

Changes

File(s) Change Summary
doped/complexes.py New module for generating, classifying, and analyzing complex defects, including vacancy geometry classification, electrostatic energy calculation, symmetry-equivalent enumeration, and multiplicity determination.
doped/utils/stenciling.py New module for reconstructing relaxed defect structures in arbitrary supercells with supporting utility functions for site matching, symmetry placement, and physical validation.
doped/generation.py Added methods to DefectsGenerator for split vacancy identification and generation using electrostatics and geometry, with placeholder for database retrieval.
doped/core.py Added helper function _get_single_valence_oxi_states to extract oxidation states from various input types.
doped/utils/configurations.py Removed legacy alias _scan_sm_stol_till_match.
docs/doped.complexes.rst New Sphinx documentation file for the doped.complexes module, including all members and inheritance relationships.
docs/doped.rst Updated to include doped.complexes in the list of submodules.
pyproject.toml Updated shakenbreak dependency from >=3.4 to >=3.4.2 and revised comment.
tests/test_complexes.py New tests for complex defect generation, symmetry, multiplicity, and molecule equivalence, especially for split vacancies and Ga2O3 structures.
tests/test_stenciling.py New tests for verifying defect stenciling into larger or differently oriented supercells, including structure matching and physical validation.
tests/data/Ga2O3_R3c_POSCAR New POSCAR file providing Ga2O3 R3c structure for test input.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant DefectsGenerator
    participant ComplexesModule
    participant StencilingUtils
    participant StructureMatcher

    User->>DefectsGenerator: get_split_vacancies()
    DefectsGenerator->>ComplexesModule: generate_complex_from_defect_sites(...)
    ComplexesModule->>ComplexesModule: classify_vacancy_geometry(...)
    ComplexesModule->>ComplexesModule: get_equivalent_complex_defect_sites_in_primitive(...)
    ComplexesModule->>ComplexesModule: get_complex_defect_multiplicity(...)
    DefectsGenerator->>User: Return split vacancy candidates

    User->>StencilingUtils: get_defect_in_supercell(defect_entry, target_supercell)
    StencilingUtils->>StencilingUtils: Place defect site via symmetry
    StencilingUtils->>StencilingUtils: Generate super-supercell
    StencilingUtils->>StencilingUtils: Stencil sites to target supercell
    StencilingUtils->>StructureMatcher: Match and validate structure
    StencilingUtils->>User: Return stenciled defect and bulk supercells
Loading

Poem

In crystal fields where defects dwell,
New tools emerge to probe and tell—
Of split vacancies, symmetry's song,
And stenciled sites where atoms belong.
With tests and docs, the code now leaps—
Through lattices, a rabbit peeps! 🐇✨

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch split_vacancies

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kavanase kavanase changed the base branch from main to develop June 9, 2025 19:48
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

🧹 Nitpick comments (10)
doped/core.py (1)

2616-2632: New helper is unreferenced—DRY opportunity
The logic for extracting single‐valence oxidation states is still duplicated in _from_pmg_defect. Replace those inline dict comprehensions with calls to _get_single_valence_oxi_states to avoid code duplication and improve maintainability.

tests/test_stenciling.py (3)

20-29: Consider using pathlib for more robust file operations.

The current implementation works but could be simplified and made more robust using pathlib.

-def if_present_rm(path):
-    """
-    Remove the file/folder if it exists.
-    """
-    if os.path.exists(path):
-        if os.path.isfile(path):
-            os.remove(path)
-        elif os.path.isdir(path):
-            shutil.rmtree(path)
+from pathlib import Path
+
+def if_present_rm(path):
+    """
+    Remove the file/folder if it exists.
+    """
+    p = Path(path)
+    if p.exists():
+        if p.is_file():
+            p.unlink()
+        elif p.is_dir():
+            shutil.rmtree(p)

61-66: Track the TODO items for comprehensive test coverage.

The TODO comments indicate planned additions for extrinsic defects and split vacancies testing, which would enhance test coverage of the new complex defect functionality.

Would you like me to create GitHub issues to track these TODO items for future implementation?


119-127: Consider extracting file I/O testing to a separate utility.

The pattern of writing to file, reading back, and comparing could be extracted to a reusable test utility method to reduce duplication across tests.

def assert_structure_roundtrip(self, structure, filepath, reference):
    """Test that structure survives POSCAR write/read cycle."""
    structure.to(fmt="vasp", filename=filepath)
    reloaded = Structure.from_file(filepath)
    assert reloaded == reference
    if_present_rm(filepath)
tests/test_complexes.py (2)

58-60: Address the TODO comment for incomplete test setup.

This commented-out code appears to be setting up a specific split vacancy configuration but is incomplete. Consider either completing the implementation or removing it if no longer needed.

Would you like me to help complete this test setup or open an issue to track this task?


153-160: Complete or remove the commented test method.

The test_Ga2O3_generate_complex_from_defect_sites method is commented out with a TODO marker. This suggests incomplete test coverage for the generate_complex_from_defect_sites function.

Would you like me to help implement this test method or create an issue to track its completion?

doped/complexes.py (2)

164-164: Avoid accessing private attributes directly.

Using structure._charge directly accesses a private attribute, which violates encapsulation principles and could break if the internal implementation changes.

Consider using a public API method or property if available, or working with the pymatgen team to expose this functionality properly.


688-692: Track TODO items in issue tracker.

There are several TODO comments that should be properly tracked:

  • Future complex generation functionality (line 688-689)
  • DOI updates when published (line 690)
  • Missing tests (line 691)

Would you like me to create GitHub issues to track these TODO items?

doped/utils/stenciling.py (2)

154-166: Address TODO comments for production readiness.

Multiple TODO items indicate this code may not be fully production-ready:

  • Missing tests (line 155-156)
  • Unimplemented symmetry analysis feature (lines 157-160)
  • Missing documentation (lines 161-165)

Would you like me to help create unit tests or open issues to track these items?


853-878: Consider extracting nested function for clarity.

The nested _check_other_sites function could be moved to module level as a private function to improve readability and testability.

This would make the main function logic clearer and allow the helper to be tested independently if needed.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ad0eee0 and 7e75028.

⛔ Files ignored due to path filters (7)
  • tests/data/Split_Vacancies/Ba3Al2(SiO4)3-mp-14049.json.gz is excluded by !**/*.gz
  • tests/data/Split_Vacancies/CsCd(PO3)3-mp-13738.json.gz is excluded by !**/*.gz
  • tests/data/Split_Vacancies/Ga2O3-mp-1243.json.gz is excluded by !**/*.gz
  • tests/data/Split_Vacancies/Ga2O3-mp-886.json.gz is excluded by !**/*.gz
  • tests/data/Split_Vacancies/GeI2-mp-1540935.json.gz is excluded by !**/*.gz
  • tests/data/Split_Vacancies/Li3Ti4O8-mp-752879.json.gz is excluded by !**/*.gz
  • tests/data/Split_Vacancies/Na2Mg(NiO2)2-mp-2217469.json.gz is excluded by !**/*.gz
📒 Files selected for processing (11)
  • docs/doped.complexes.rst (1 hunks)
  • docs/doped.rst (1 hunks)
  • doped/complexes.py (1 hunks)
  • doped/core.py (1 hunks)
  • doped/generation.py (3 hunks)
  • doped/utils/configurations.py (0 hunks)
  • doped/utils/stenciling.py (1 hunks)
  • pyproject.toml (1 hunks)
  • tests/data/Ga2O3_R3c_POSCAR (1 hunks)
  • tests/test_complexes.py (1 hunks)
  • tests/test_stenciling.py (1 hunks)
💤 Files with no reviewable changes (1)
  • doped/utils/configurations.py
🧰 Additional context used
🪛 Pylint (3.3.7)
tests/test_complexes.py

[refactor] 281-281: Too many local variables (24/15)

(R0914)


[refactor] 403-403: Too many local variables (28/15)

(R0914)


[refactor] 542-542: Too many local variables (31/15)

(R0914)

tests/test_stenciling.py

[refactor] 44-44: Too many instance attributes (10/7)

(R0902)

doped/generation.py

[refactor] 2465-2465: Too many local variables (16/15)

(R0914)


[refactor] 2539-2585: Too many nested blocks (7/5)

(R1702)

doped/complexes.py

[refactor] 238-238: Too many arguments (9/5)

(R0913)


[refactor] 238-238: Too many positional arguments (9/5)

(R0917)


[refactor] 238-238: Too many local variables (36/15)

(R0914)


[error] 366-373: Unexpected keyword argument 'symm_ops' in function call

(E1123)


[error] 366-373: Unexpected keyword argument 'dist_tol' in function call

(E1123)


[refactor] 567-567: Too many arguments (8/5)

(R0913)


[refactor] 567-567: Too many positional arguments (8/5)

(R0917)

doped/utils/stenciling.py

[refactor] 44-44: Too many arguments (6/5)

(R0913)


[refactor] 44-44: Too many positional arguments (6/5)

(R0917)


[refactor] 44-44: Too many local variables (36/15)

(R0914)


[error] 48-48: Value 'np.ndarray' is unsubscriptable

(E1136)


[refactor] 44-44: Too many statements (56/50)

(R0915)


[error] 348-348: Value 'np.ndarray' is unsubscriptable

(E1136)


[refactor] 385-385: Too many arguments (6/5)

(R0913)


[refactor] 385-385: Too many positional arguments (6/5)

(R0917)


[refactor] 385-385: Too many local variables (32/15)

(R0914)


[refactor] 385-385: Too many branches (14/12)

(R0912)


[refactor] 385-385: Too many statements (54/50)

(R0915)


[error] 576-576: Value 'np.ndarray' is unsubscriptable

(E1136)


[refactor] 913-913: Too many local variables (23/15)

(R0914)


[error] 1025-1025: Value 'np.ndarray' is unsubscriptable

(E1136)


[error] 1097-1097: Value 'np.ndarray' is unsubscriptable

(E1136)

🪛 Ruff (0.11.9)
tests/test_stenciling.py

67-67: Function name test_Se_20_Å_supercell contains a non-ASCII character

(PLC2401)

⏰ Context from checks skipped due to timeout of 90000ms (20)
  • GitHub Check: test (ubuntu-latest, 3.13, 10)
  • GitHub Check: test (ubuntu-latest, 3.13, 7)
  • GitHub Check: test (ubuntu-latest, 3.13, 4)
  • GitHub Check: test (ubuntu-latest, 3.13, 6)
  • GitHub Check: test (ubuntu-latest, 3.13, 9)
  • GitHub Check: test (ubuntu-latest, 3.13, 3)
  • GitHub Check: test (ubuntu-latest, 3.10, 9)
  • GitHub Check: test (ubuntu-latest, 3.13, 8)
  • GitHub Check: test (ubuntu-latest, 3.13, 1)
  • GitHub Check: test (ubuntu-latest, 3.13, 5)
  • GitHub Check: test (ubuntu-latest, 3.10, 6)
  • GitHub Check: test (ubuntu-latest, 3.10, 7)
  • GitHub Check: test (ubuntu-latest, 3.10, 2)
  • GitHub Check: test (ubuntu-latest, 3.13, 2)
  • GitHub Check: test (ubuntu-latest, 3.10, 4)
  • GitHub Check: test (ubuntu-latest, 3.10, 8)
  • GitHub Check: test (ubuntu-latest, 3.10, 10)
  • GitHub Check: test (ubuntu-latest, 3.10, 3)
  • GitHub Check: test (ubuntu-latest, 3.10, 1)
  • GitHub Check: test (ubuntu-latest, 3.10, 5)
🔇 Additional comments (11)
docs/doped.complexes.rst (1)

1-7: Documentation addition looks good
The Sphinx automodule directive correctly exposes doped.complexes with members, undocumented items, and inheritance. No issues found.

pyproject.toml (1)

26-26: LGTM!

The shakenbreak version update to >=3.4.2 is appropriate for the StructureMatcher efficiency improvements needed by the new complex defect functionality.

docs/doped.rst (1)

20-20: Documentation correctly updated.

The new doped.complexes submodule is properly added to the documentation structure in alphabetical order.

tests/data/Ga2O3_R3c_POSCAR (1)

1-39: Valid POSCAR test data file.

The Ga2O3 R3c structure file is properly formatted with correct atom counts, fractional coordinates, and helpful oxidation state annotations for defect complex testing.

tests/test_stenciling.py (1)

93-110: Well-structured test with proper assertions and cleanup.

The test implementation includes:

  • Proper warning handling and verification
  • Structure matching with tight tolerances
  • Minimum distance validation
  • File cleanup after assertions

Good test practices demonstrated here.

tests/test_complexes.py (5)

1-10: LGTM!

The module docstring and imports are well-organized and appropriate for a test module.


40-49: LGTM!

The helper function correctly handles file and directory removal with appropriate existence checks.


62-116: Well-designed test for periodicity-breaking edge case.

This test effectively validates that the defect multiplicities are calculated correctly even with periodicity-breaking supercells, and demonstrates where the standard pymatgen approach fails. Good test coverage!


162-252: Comprehensive molecule matching tests.

Excellent test coverage for molecule matching functionality, including translation, rotation, and permutation invariance. The tests properly validate both the custom are_equivalent_molecules function and compare it with pymatgen's KabschMatcher behavior.


254-269: LGTM!

The helper functions are well-implemented. The rotation matrix calculation is mathematically correct.

doped/utils/stenciling.py (1)

1016-1018:

Details

❓ Verification inconclusive

Resolve TODO about removed check.

The TODO comment indicates uncertainty about why a check was removed in commit 40f129f. This should be investigated and resolved.


🏁 Script executed:

#!/bin/bash
# Check git history for context on why the check was removed
git log -p --grep="40f129f" || git show 40f129f

Length of output: 0


I’ll dig deeper into the Git history to find the context for when and why _check_min_dist was removed.


🏁 Script executed:

#!/bin/bash
# Verify if commit 40f129f exists and show its details; otherwise, show recent commits touching the file
if git rev-parse --verify 40f129f >/dev/null 2>&1; then
  git show --stat 40f129f
else
  echo "⚠️ Commit 40f129f not found in history"
  echo
  echo "Listing the most recent commits affecting doped/utils/stenciling.py:"
  git log -n 20 --pretty=format:"%h  %ci  %s" -- doped/utils/stenciling.py
fi

Length of output: 274


I’ll locate the exact file path and inspect its recent Git history for any related removals of _check_min_dist.


🏁 Script executed:

#!/bin/bash
# Find the file path for stenciling.py and show its recent commit history
FILE_PATH=$(fd stenciling.py | head -n1)
if [ -n "$FILE_PATH" ]; then
  echo "Found file at: $FILE_PATH"
  echo "Showing last 20 commits touching $FILE_PATH:"
  git log -n 20 --pretty=format:"%h  %ci  %s" -- "$FILE_PATH"
else
  echo "❌ stenciling.py not found in repo"
fi

Length of output: 328


Investigate rationale behind _check_min_dist call
We couldn’t locate commit 40f129f in the current history, so it’s unclear why this check was removed. Please confirm whether _check_min_dist should be kept or can safely be dropped, then update or remove this TODO accordingly.

• File: doped/utils/stenciling.py
Lines: 1016–1018

    # TODO: This check was removed before (in `40f129f`), not sure if for a proper reason -- check and
    #  remove this note
    _check_min_dist(Structure.from_sites(bulk_outer_cell_sites), min_dist_tol, warning=False)

Comment on lines +2616 to +2639
def _get_single_valence_oxi_states(bulk_oxi_states: Structure | Composition | dict = False) -> dict:
if isinstance(bulk_oxi_states, Structure):
single_valence_oxi_states = {
el.symbol: el.oxi_state for el in bulk_oxi_states.composition.elements
}
elif isinstance(bulk_oxi_states, Composition):
single_valence_oxi_states = {el.symbol: el.oxi_state for el in bulk_oxi_states.elements}
elif isinstance(bulk_oxi_states, dict):
single_valence_oxi_states = bulk_oxi_states
else:
raise TypeError(
f"Input bulk_oxi_states must be a pymatgen Structure, Composition or dict, not "
f"{type(bulk_oxi_states)}."
)

return single_valence_oxi_states

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Type hint should match default False
The signature declares bulk_oxi_states: Structure | Composition | dict but defaults it to False (a bool). This mismatch will always trigger the TypeError arm for the default. Either include bool in the annotation or switch the default to None (and handle None explicitly) to accurately reflect allowed input types.

🤖 Prompt for AI Agents
In doped/core.py around lines 2616 to 2632, the function
_get_single_valence_oxi_states has a type hint indicating bulk_oxi_states can be
Structure, Composition, or dict, but the default value is False (a bool),
causing a type mismatch and always triggering the TypeError. To fix this, change
the default value to None and update the type hint to include Optional for these
types. Also, add explicit handling for None in the function to avoid the
TypeError for the default case.

# consistency with other parts of the code
# TODO: Useful test case could be trying to stencil split vacancies in new supercells...

def test_Se_20_Å_supercell(self):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove non-ASCII character from function name.

The function name contains the Ångström symbol (Å) which can cause issues in some environments. Consider using ASCII-only names.

-def test_Se_20_Å_supercell(self):
+def test_Se_20_A_supercell(self):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_Se_20_Å_supercell(self):
- def test_Se_20_Å_supercell(self):
+ def test_Se_20_A_supercell(self):
🧰 Tools
🪛 Ruff (0.11.9)

67-67: Function name test_Se_20_Å_supercell contains a non-ASCII character

(PLC2401)

🤖 Prompt for AI Agents
In tests/test_stenciling.py at line 67, the test function name contains a
non-ASCII character (Å). Rename the function to use only ASCII characters, for
example by replacing the Å symbol with 'A' or 'Angstrom' to ensure compatibility
across all environments.

Comment on lines 403 to 564
def test_symmetry_multiplicity_C2m_Ga2O3(self):
"""
Test symmetry and multiplicity analysis for split vacancies in C2/m
Ga2O3.
Reference values have been manually checked and stored in the info
dicts.
For C2/m Ga2O3, we have no periodicity breaking. Ga site symmetries are
all Cs, while we have Cs, C2h, C2 and C1 interstitial site symmetries.
"""
info_dict = self.C2m_Ga2O3_split_vac_info_dict
bulk_sga = get_sga(info_dict["bulk_supercell"])
bulk_supercell_symm_ops = bulk_sga.get_symmetry_operations()
bulk_prim = get_primitive_structure(info_dict["bulk_supercell"])
supercell_over_prim_factor = len(info_dict["bulk_supercell"]) / len(bulk_prim)
molecule_dict = {}

ml_calculated_split_vac_dicts = [
(energy, subdict)
for cation_dict in info_dict.values()
if isinstance(cation_dict, dict) and "split_vacancies_energy_dict" in cation_dict
for energy, subdict in cation_dict["split_vacancies_energy_dict"].items()
if self.energy_key in subdict
]

for energy, subdict in ml_calculated_split_vac_dicts:
orig_split_vacancy = generate_complex_from_defect_sites(
info_dict["bulk_supercell"],
vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
interstitial_sites=[subdict["interstitial_site"]],
)
orig_split_vacancy_symm = point_symmetry_from_structure(
orig_split_vacancy, info_dict["bulk_supercell"], relaxed=True, verbose=False
)
assert orig_split_vacancy_symm == subdict["unrelaxed_split_vac_symm"]

for v_idx in [1, 2]:
vac_symm = point_symmetry_from_site(
subdict[f"vac_defect_{v_idx}_site"], info_dict["bulk_supercell"]
)
assert vac_symm == "Cs" # only Cs Ga sites in C2/m Ga2O3
assert vac_symm == subdict[f"vac_{v_idx}_symm"]

int_symm = point_symmetry_from_site(subdict["interstitial_site"], info_dict["bulk_supercell"])
assert int_symm in [
"Cs",
"C2h",
"C2",
"C1",
] # unrelaxed interstitial sites in C2/m Ga2O3
assert int_symm == subdict["int_symm"]
relaxed_symm = point_symmetry_from_structure(
Structure.from_ase_atoms(subdict[self.energy_key]["struct_atoms"]),
info_dict["bulk_supercell"],
verbose=False,
)
assert relaxed_symm == subdict["relaxed_split_vac_symm"]

comp_mult = get_complex_defect_multiplicity(
bulk_supercell=info_dict["bulk_supercell"],
vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
interstitial_sites=[subdict["interstitial_site"]],
)
assert comp_mult == subdict["multiplicity"]
comp_mult = get_complex_defect_multiplicity(
bulk_supercell=info_dict["bulk_supercell"],
vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
interstitial_sites=[subdict["interstitial_site"]],
primitive_structure=bulk_prim,
supercell_symm_ops=bulk_supercell_symm_ops,
) # same answer with efficiency options
assert comp_mult == subdict["multiplicity"]
supercell_comp_mult = get_complex_defect_multiplicity(
bulk_supercell=info_dict["bulk_supercell"],
vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
interstitial_sites=[subdict["interstitial_site"]],
primitive_structure=bulk_prim,
supercell_symm_ops=bulk_supercell_symm_ops,
primitive_cell_multiplicity=False,
)
assert supercell_comp_mult == subdict["multiplicity"] * supercell_over_prim_factor

int_obj = Interstitial(info_dict["bulk_supercell"], subdict["interstitial_site"])
v1_obj = Vacancy(info_dict["bulk_supercell"], subdict["vac_defect_1_site"])
v2_obj = Vacancy(info_dict["bulk_supercell"], subdict["vac_defect_2_site"])

assert int_obj.multiplicity == subdict["int_multiplicity"] * supercell_over_prim_factor
assert v1_obj.multiplicity == subdict["vac_1_multiplicity"] * supercell_over_prim_factor
assert v2_obj.multiplicity == subdict["vac_2_multiplicity"] * supercell_over_prim_factor

point_symm_mult_dict = {
"Cs": 2,
"C2h": 1,
"C2": 2,
"C1": 4,
}
for v_idx, v_obj in enumerate([v1_obj, v2_obj]):
assert (
v_obj.multiplicity
== point_symm_mult_dict[subdict[f"vac_{v_idx+1}_symm"]] * supercell_over_prim_factor
)
assert (
int_obj.multiplicity
== point_symm_mult_dict[subdict["int_symm"]] * supercell_over_prim_factor
)

# this isn't universally true for complex defects, but is for the split vacancies:
assert (
comp_mult
<= (int_obj.multiplicity * v1_obj.multiplicity * v2_obj.multiplicity)
/ supercell_over_prim_factor**3
)

# in C2/m Ga2O3, each split vacancy has the same multiplicity as the interstitial site
assert comp_mult == int_obj.multiplicity / supercell_over_prim_factor

raw_complex_defect_sites = [
subdict["vac_defect_1_site"],
subdict["vac_defect_2_site"],
subdict["interstitial_site"],
]
molecule_dict[round(float(energy), 2)] = Molecule(
[site.species for site in raw_complex_defect_sites],
info_dict["bulk_supercell"].lattice.get_cartesian_coords(
[
site.frac_coords + raw_complex_defect_sites[0].distance_and_image(site)[1]
for site in raw_complex_defect_sites
]
),
)

# Here the `relaxed` -3538.32 structure goes to the -3540.98 structure, so can use this in future
# tests of relaxed complex defect site determinations! (TODO)
for key, mol in molecule_dict.items():
for other_key, other_mol in molecule_dict.items():
if key != other_key:
assert not are_equivalent_molecules(mol, other_mol)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider breaking down this complex test method.

With 28 local variables, this method is difficult to follow. Similar refactoring as suggested above would improve maintainability.

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 403-403: Too many local variables (28/15)

(R0914)

🤖 Prompt for AI Agents
In tests/test_complexes.py from lines 403 to 541, the
test_symmetry_multiplicity_C2m_Ga2O3 method is overly complex with 28 local
variables, making it hard to follow. Refactor by splitting this large test into
smaller focused helper methods or sub-tests, each handling a specific part of
the logic such as symmetry checks, multiplicity assertions, and molecule
equivalence tests. This modular approach will improve readability and
maintainability without changing the test logic.

Comment on lines 281 to 429
def test_symmetry_multiplicity_R3c_Ga2O3(self):
"""
Test symmetry and multiplicity analysis for split vacancies in R-3c
Ga2O3.
Reference values have been manually checked and stored in the info
dicts.
For R-3c Ga2O3, we have periodicity breaking in the generated supercell
which breaks the rotational (but not inversion) symmetries, so C3i ->
Ci, C2 -> C1. Useful test case for future stenciling to avoid
periodicity breaking.
"""
info_dict = self.R3c_Ga2O3_split_vac_info_dict
bulk_sga = get_sga(info_dict["bulk_supercell"])
bulk_supercell_symm_ops = bulk_sga.get_symmetry_operations()
bulk_prim = get_primitive_structure(info_dict["bulk_supercell"])
supercell_over_prim_factor = len(info_dict["bulk_supercell"]) / len(bulk_prim)
molecule_dict = {}

ml_calculated_split_vac_dicts = [
(energy, subdict)
for cation_dict in info_dict.values()
if isinstance(cation_dict, dict) and "split_vacancies_energy_dict" in cation_dict
for energy, subdict in cation_dict["split_vacancies_energy_dict"].items()
if self.energy_key in subdict
]

for energy, subdict in ml_calculated_split_vac_dicts:
orig_split_vacancy = generate_complex_from_defect_sites(
info_dict["bulk_supercell"],
vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
interstitial_sites=[subdict["interstitial_site"]],
)
orig_split_vacancy_symm = point_symmetry_from_structure(
orig_split_vacancy, info_dict["bulk_supercell"], relaxed=True, verbose=False
)
assert orig_split_vacancy_symm == subdict["unrelaxed_split_vac_symm"]

for v_idx in [1, 2]:
vac_symm = point_symmetry_from_site(
subdict[f"vac_defect_{v_idx}_site"], info_dict["bulk_supercell"]
)
assert vac_symm == "C3" # only C3 Ga sites in R-3c Ga2O3
assert vac_symm == subdict[f"vac_{v_idx}_symm"]

int_symm = point_symmetry_from_site(subdict["interstitial_site"], info_dict["bulk_supercell"])
assert int_symm in [
"C2",
"C3i",
"D3",
] # unrelaxed interstitial sites in R-3c Ga2O3
assert int_symm == subdict["int_symm"]
relaxed_symm = point_symmetry_from_structure(
Structure.from_ase_atoms(subdict[self.energy_key]["struct_atoms"]),
info_dict["bulk_supercell"],
verbose=False,
)
assert relaxed_symm == subdict["relaxed_split_vac_symm"]

comp_mult = get_complex_defect_multiplicity(
bulk_supercell=info_dict["bulk_supercell"],
vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
interstitial_sites=[subdict["interstitial_site"]],
)
assert comp_mult == subdict["multiplicity"]
comp_mult = get_complex_defect_multiplicity(
bulk_supercell=info_dict["bulk_supercell"],
vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
interstitial_sites=[subdict["interstitial_site"]],
primitive_structure=bulk_prim,
supercell_symm_ops=bulk_supercell_symm_ops,
) # same answer with efficiency options
assert comp_mult == subdict["multiplicity"]
supercell_comp_mult = get_complex_defect_multiplicity(
bulk_supercell=info_dict["bulk_supercell"],
vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
interstitial_sites=[subdict["interstitial_site"]],
primitive_structure=bulk_prim,
supercell_symm_ops=bulk_supercell_symm_ops,
primitive_cell_multiplicity=False,
)
assert supercell_comp_mult == subdict["multiplicity"] * supercell_over_prim_factor

int_obj = Interstitial(info_dict["bulk_supercell"], subdict["interstitial_site"])
v1_obj = Vacancy(info_dict["bulk_supercell"], subdict["vac_defect_1_site"])
v2_obj = Vacancy(info_dict["bulk_supercell"], subdict["vac_defect_2_site"])

assert int_obj.multiplicity == subdict["int_multiplicity"] * supercell_over_prim_factor
assert v1_obj.multiplicity == subdict["vac_1_multiplicity"] * supercell_over_prim_factor
assert v2_obj.multiplicity == subdict["vac_2_multiplicity"] * supercell_over_prim_factor

# this isn't universally true for complex defects, but is for the split vacancies:
assert (
comp_mult
<= (int_obj.multiplicity * v1_obj.multiplicity * v2_obj.multiplicity)
/ supercell_over_prim_factor**3
)

raw_complex_defect_sites = [
subdict["vac_defect_1_site"],
subdict["vac_defect_2_site"],
subdict["interstitial_site"],
]
molecule_dict[round(float(energy), 2)] = Molecule(
[site.species for site in raw_complex_defect_sites],
info_dict["bulk_supercell"].lattice.get_cartesian_coords(
[
site.frac_coords + raw_complex_defect_sites[0].distance_and_image(site)[1]
for site in raw_complex_defect_sites
]
),
)

# in this case, two of the molecules are actually equivalent, with slightly different ES energies
# because of the periodicity breaking in the generated supercell:
for key, mol in molecule_dict.items():
if key in [-4245.99, -4246.28]:
assert are_equivalent_molecules(molecule_dict[-4245.99], mol)
else:
assert not are_equivalent_molecules(molecule_dict[-4245.99], mol)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor to reduce method complexity.

This test method has 24 local variables (pylint warns about >15). Consider extracting helper methods to improve readability and maintainability:

  • Extract molecule creation logic into a helper method
  • Extract symmetry validation logic into a separate method
  • Consider using a dataclass or named tuple for the subdict data structure

Example refactoring approach:

def _validate_site_symmetries(self, subdict, bulk_supercell):
    """Validate vacancy and interstitial site symmetries."""
    for v_idx in [1, 2]:
        vac_symm = point_symmetry_from_site(
            subdict[f"vac_defect_{v_idx}_site"], bulk_supercell
        )
        assert vac_symm == "C3"  # only C3 Ga sites in R-3c Ga2O3
        assert vac_symm == subdict[f"vac_{v_idx}_symm"]
    
    int_symm = point_symmetry_from_site(subdict["interstitial_site"], bulk_supercell)
    assert int_symm in ["C2", "C3i", "D3"]
    assert int_symm == subdict["int_symm"]
    return int_symm

def _create_molecule_from_sites(self, raw_sites, lattice):
    """Create molecule from defect sites."""
    return Molecule(
        [site.species for site in raw_sites],
        lattice.get_cartesian_coords([
            site.frac_coords + raw_sites[0].distance_and_image(site)[1]
            for site in raw_sites
        ])
    )
🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 281-281: Too many local variables (24/15)

(R0914)

🤖 Prompt for AI Agents
In tests/test_complexes.py from lines 281 to 402, the
test_symmetry_multiplicity_R3c_Ga2O3 method is overly complex with 24 local
variables, making it hard to read and maintain. Refactor by extracting the
molecule creation logic into a helper method that takes defect sites and lattice
to return a Molecule instance, and extract the symmetry validation logic into a
separate helper method that asserts vacancy and interstitial site symmetries.
Additionally, consider using a dataclass or named tuple to represent the subdict
data structure for clarity and easier access to its fields.

Comment on lines 542 to 696
def test_symmetry_multiplicity_tricky_structures(self):
"""
Test symmetry and multiplicity analysis for split vacancies in some
complex structures.
Some of these had issues with initial symmetry/multiplicity analysis
prior to robustness updates. Here we test that the analysis proceeds
fine and matches expected values.
"""
database_folder = f"{data_dir}/Split_Vacancies"
formula_mpid_files = [i for i in os.listdir(database_folder) if "Ga2O3" not in i]
for formula_mpid_file in formula_mpid_files:
info_dict = loadfn(f"{database_folder}/{formula_mpid_file}")
bulk_sga = get_sga(info_dict["bulk_supercell"])
bulk_supercell_symm_ops = bulk_sga.get_symmetry_operations()
bulk_prim = get_primitive_structure(info_dict["bulk_supercell"])
supercell_over_prim_factor = len(info_dict["bulk_supercell"]) / len(bulk_prim)
molecule_dict = {}

ml_calculated_split_vac_dicts = [
(energy, subdict)
for cation_dict in info_dict.values()
if isinstance(cation_dict, dict) and "split_vacancies_energy_dict" in cation_dict
for energy, subdict in cation_dict["split_vacancies_energy_dict"].items()
if self.energy_key in subdict
]
if "GeI2" not in formula_mpid_file and "Li3Ti4O8" not in formula_mpid_file:
# take a random subset of max 10:
ml_calculated_split_vac_dicts = random.sample(
ml_calculated_split_vac_dicts, min(10, len(ml_calculated_split_vac_dicts))
)
for energy, subdict in ml_calculated_split_vac_dicts:
orig_split_vacancy = generate_complex_from_defect_sites(
info_dict["bulk_supercell"],
vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
interstitial_sites=[subdict["interstitial_site"]],
)
orig_split_vacancy_symm = point_symmetry_from_structure(
orig_split_vacancy, info_dict["bulk_supercell"], relaxed=True, verbose=False
)
assert orig_split_vacancy_symm == subdict["unrelaxed_split_vac_symm"]

for v_idx in [1, 2]:
vac_symm = point_symmetry_from_site(
subdict[f"vac_defect_{v_idx}_site"], info_dict["bulk_supercell"]
)
assert vac_symm == subdict[f"vac_{v_idx}_symm"]

int_symm = point_symmetry_from_site(
subdict["interstitial_site"], info_dict["bulk_supercell"]
)
assert int_symm == subdict["int_symm"]
relaxed_symm = point_symmetry_from_structure(
Structure.from_ase_atoms(subdict[self.energy_key]["struct_atoms"]),
info_dict["bulk_supercell"],
verbose=False,
)
assert relaxed_symm == subdict["relaxed_split_vac_symm"]

comp_mult = get_complex_defect_multiplicity(
bulk_supercell=info_dict["bulk_supercell"],
vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
interstitial_sites=[subdict["interstitial_site"]],
) # this breaks if we get a multiplicity less than expected minimum
assert comp_mult == subdict["multiplicity"]
comp_mult = get_complex_defect_multiplicity(
bulk_supercell=info_dict["bulk_supercell"],
vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
interstitial_sites=[subdict["interstitial_site"]],
primitive_structure=bulk_prim,
supercell_symm_ops=bulk_supercell_symm_ops,
) # same answer with efficiency options
assert comp_mult == subdict["multiplicity"]
supercell_comp_mult = get_complex_defect_multiplicity(
bulk_supercell=info_dict["bulk_supercell"],
vacancy_sites=[subdict["vac_defect_1_site"], subdict["vac_defect_2_site"]],
interstitial_sites=[subdict["interstitial_site"]],
primitive_structure=bulk_prim,
supercell_symm_ops=bulk_supercell_symm_ops,
primitive_cell_multiplicity=False,
)
assert supercell_comp_mult == subdict["multiplicity"] * supercell_over_prim_factor

int_obj = Interstitial(info_dict["bulk_supercell"], subdict["interstitial_site"])
v1_obj = Vacancy(info_dict["bulk_supercell"], subdict["vac_defect_1_site"])
v2_obj = Vacancy(info_dict["bulk_supercell"], subdict["vac_defect_2_site"])

assert int_obj.multiplicity == subdict["int_multiplicity"] * supercell_over_prim_factor
assert v1_obj.multiplicity == subdict["vac_1_multiplicity"] * supercell_over_prim_factor
assert v2_obj.multiplicity == subdict["vac_2_multiplicity"] * supercell_over_prim_factor

for v_idx, v_obj in enumerate([v1_obj, v2_obj]):
assert (
v_obj.multiplicity
== subdict[f"vac_{v_idx+1}_multiplicity"] * supercell_over_prim_factor
)
assert int_obj.multiplicity == subdict["int_multiplicity"] * supercell_over_prim_factor

# this isn't universally true for complex defects, but is for the split vacancies:
assert (
comp_mult
<= (int_obj.multiplicity * v1_obj.multiplicity * v2_obj.multiplicity)
/ supercell_over_prim_factor**3
)
assert comp_mult >= int_obj.multiplicity / supercell_over_prim_factor

raw_complex_defect_sites = [
subdict["vac_defect_1_site"],
subdict["vac_defect_2_site"],
subdict["interstitial_site"],
]
molecule_dict[round(float(energy), 2)] = Molecule(
[site.species for site in raw_complex_defect_sites],
info_dict["bulk_supercell"].lattice.get_cartesian_coords(
[
site.frac_coords + raw_complex_defect_sites[0].distance_and_image(site)[1]
for site in raw_complex_defect_sites
]
),
)

eq_mol_count = 0
for key, mol in molecule_dict.items():
for other_key, other_mol in molecule_dict.items():
if ( # same cations, different molecules
sorted(mol.atomic_numbers) == sorted(other_mol.atomic_numbers) and key != other_key
) and are_equivalent_molecules(mol, other_mol):
eq_mol_count += 1

if "GeI2" not in formula_mpid_file and "Li3Ti4O8" not in formula_mpid_file:
assert eq_mol_count == 0
elif "GeI2" in formula_mpid_file:
assert eq_mol_count == 8
elif "Li3Ti4O8" in formula_mpid_file:
assert eq_mol_count == 80
# Note that a lot of these have matching equivalent relaxed complexes, so would make good test
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Extract common test logic to reduce complexity.

This method has 31 local variables and significant duplication with the previous test methods. Consider creating a base test method that handles the common validation logic.

Consider creating a shared validation method:

def _validate_split_vacancy_data(self, info_dict, energy_key, expected_eq_mol_count=0):
    """Common validation logic for split vacancy data."""
    bulk_sga = get_sga(info_dict["bulk_supercell"])
    bulk_supercell_symm_ops = bulk_sga.get_symmetry_operations()
    bulk_prim = get_primitive_structure(info_dict["bulk_supercell"])
    
    # Extract and validate split vacancy data
    ml_calculated_split_vac_dicts = self._extract_ml_split_vac_dicts(info_dict, energy_key)
    molecule_dict = {}
    
    for energy, subdict in ml_calculated_split_vac_dicts:
        self._validate_single_split_vacancy(subdict, info_dict, bulk_prim, bulk_supercell_symm_ops)
        molecule_dict[round(float(energy), 2)] = self._create_molecule_from_sites(
            [subdict["vac_defect_1_site"], subdict["vac_defect_2_site"], subdict["interstitial_site"]],
            info_dict["bulk_supercell"].lattice
        )
    
    # Validate molecule equivalences
    eq_mol_count = self._count_equivalent_molecules(molecule_dict)
    assert eq_mol_count == expected_eq_mol_count
🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 542-542: Too many local variables (31/15)

(R0914)

🤖 Prompt for AI Agents
In tests/test_complexes.py lines 542 to 677, the
test_symmetry_multiplicity_tricky_structures method is overly complex with many
local variables and duplicated logic from other tests. Refactor by extracting
the common validation logic into a separate helper method, for example
_validate_split_vacancy_data, which takes info_dict, energy_key, and
expected_eq_mol_count as parameters. Move the symmetry, multiplicity, and
molecule equivalence checks into this helper, and call it from the test method
to reduce duplication and improve readability.

Comment on lines 366 to 432
for frac_coords in get_equiv_frac_coords_in_primitive(
site.frac_coords,
primitive_structure,
bulk_supercell,
symm_ops=supercell_symm_ops, # only small speed up (due to caching)
symprec=symprec,
dist_tol=dist_tol,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify that get_equiv_frac_coords_in_primitive accepts the keyword arguments.

Static analysis indicates that symm_ops and dist_tol may be unexpected keyword arguments for this function.


🏁 Script executed:

#!/bin/bash
# Search for the function definition to verify accepted parameters
ast-grep --pattern 'def get_equiv_frac_coords_in_primitive($$$)'

Length of output: 15931


Invalid keyword arguments passed to get_equiv_frac_coords_in_primitive

The signature of get_equiv_frac_coords_in_primitive (see doped/utils/symmetry.py:1223) is:

def get_equiv_frac_coords_in_primitive(
    frac_coords,
    primitive,
    supercell,
    symprec=0.01,
    dist_tol_factor=1.0,
    equiv_coords=True,
    return_symprec_and_dist_tol_factor=False,
    fixed_symprec_and_dist_tol_factor=False,
    verbose=False,
) -> ...

It does not accept symm_ops or dist_tol. Please update the call in doped/complexes.py accordingly:

  • Remove the symm_ops=supercell_symm_ops argument.
  • Replace dist_tol=… with dist_tol_factor=… (or drop it if you intend to use the default).

Location:

  • File: doped/complexes.py
  • Lines: ~366–373

Suggested diff:

- for frac_coords in get_equiv_frac_coords_in_primitive(
-     site.frac_coords,
-     primitive_structure,
-     bulk_supercell,
-     symm_ops=supercell_symm_ops,
-     symprec=symprec,
-     dist_tol=dist_tol,
- )
+ for frac_coords in get_equiv_frac_coords_in_primitive(
+     site.frac_coords,
+     primitive_structure,
+     bulk_supercell,
+     symprec=symprec,
+     dist_tol_factor=dist_tol_factor,  # if you want to override the default
+ )
🧰 Tools
🪛 Pylint (3.3.7)

[error] 366-373: Unexpected keyword argument 'symm_ops' in function call

(E1123)


[error] 366-373: Unexpected keyword argument 'dist_tol' in function call

(E1123)

🤖 Prompt for AI Agents
In doped/complexes.py around lines 366 to 373, the call to
get_equiv_frac_coords_in_primitive incorrectly passes symm_ops and dist_tol as
keyword arguments, which are not accepted by the function. Remove the
symm_ops=supercell_symm_ops argument entirely and replace dist_tol=dist_tol with
dist_tol_factor=dist_tol if you want to specify it, or omit it to use the
default value. Adjust the call accordingly to match the function signature in
doped/utils/symmetry.py.


def _convert_defect_neighbours_to_X(
defect_supercell: Structure,
defect_position: np.ndarray[float],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix all remaining numpy array type hints.

Multiple functions have the same invalid numpy type hint syntax.

Apply this pattern throughout the file:

# Line 576
-    defect_position: np.ndarray[float],
+    defect_position: np.ndarray,

# Line 1025  
-    defect_frac_coords: np.ndarray[float] | list[float] | None = None,
+    defect_frac_coords: np.ndarray | list[float] | None = None,

# Line 1097
-    lattice: Lattice, cart_coords: np.ndarray[float] | list[float], tol: float = 1e-5
+    lattice: Lattice, cart_coords: np.ndarray | list[float], tol: float = 1e-5

Also applies to: 1025-1025, 1097-1097

🧰 Tools
🪛 Pylint (3.3.7)

[error] 576-576: Value 'np.ndarray' is unsubscriptable

(E1136)

🤖 Prompt for AI Agents
In doped/utils/stenciling.py at lines 576, 1025, and 1097, the numpy array type
hints use an invalid syntax like np.ndarray[float]. Replace these with the
correct typing syntax by importing numpy.typing as npt and using npt.NDArray
with the appropriate dtype, for example npt.NDArray[np.float64]. Update all such
type hints in these lines accordingly.

Comment on lines 44 to 339
def get_defect_in_supercell(
defect_entry: DefectEntry,
target_supercell: Structure,
check_bulk: bool = True,
target_frac_coords: np.ndarray[float] | list[float] | bool = True,
edge_tol: float = 1,
min_dist_tol_factor: float = 0.99,
) -> tuple[Structure, Structure]:
"""
Re-generate a relaxed defect structure in a different supercell.
This function takes the relaxed defect structure of the input
``DefectEntry`` (from ``DefectEntry.defect_supercell``) and re-generates it
in the ``target_supercell`` structure, and the closest possible position to
``target_frac_coords`` (default is the supercell centre = [0.5, 0.5, 0.5]),
also providing the corresponding bulk supercell (which should be the same
for each generated defect supercell given the same ``target_supercell`` and
base supercell for ``defect_entry``, see note below).
``target_supercell`` should be the same host crystal structure, just with
different supercell dimensions, having the same lattice parameters and bond
lengths.
Note: This function does `not` guarantee that the generated defect
supercell atomic position basis exactly matches that of
``target_supercell``, which may have come from a different primitive
structure definition (e.g. CdTe with
``{"Cd": [0,0,0], "Te": [0.25,0.25,0.25]}`` vs
``{"Cd": [0,0,0], "Te": [0.75,0.75,0.75]}``). The generated supercell
`will` have the exact same lattice/cell definition with fully
symmetry-equivalent atom positions, but if the actual position basis
differs then this can cause issues with parsing finite-size corrections
(which rely on site-matched potentials). This is perfectly fine if it
occurs, just will require the use of a matching bulk/reference supercell
when parsing (rather than the input ``target_supercell``) -- ``doped`` will
also throw a warning about this when parsing if a non-matching bulk
supercell is used anyway. This function will automatically check if the
position basis in the generated supercell differs from that of
``target_supercell``, printing a warning if so (unless ``check_bulk`` is
``False``) and returning the corresponding bulk supercell which should be
used for parsing defect calculations with the generated supercell. Of
course, if generating multiple defects in the same ``target_supercell``,
only one such bulk supercell calculation should be required (should
correspond to the same bulk supercell in each case).
Briefly, this function works by:
- Translating the defect site to the centre of the original supercell.
- Identifying a super-supercell which fully encompasses the target
supercell (regardless of orientation).
- Generate this super-supercell, using one copy of the original defect
supercell (``DefectEntry.defect_supercell``), and the rest of the sites
(outside of the original defect supercell box, with the defect translated
to the centre) are populated using the bulk supercell
(``DefectEntry.bulk_supercell``).
- Translate the defect site in this super-supercell to the Cartesian
coordinates of the centre of ``target_supercell``, then stencil out all
sites in the ``target_supercell`` portion of the super-supercell,
accounting for possible site displacements in the relaxed defect
supercell (e.g. if ``target_supercell`` has a different shape and does
not fully encompass the original defect supercell).
This is done by scanning over possible combinations of sites near the
boundary regions of the ``target_supercell`` portion, and identifying the
combination which maximises the minimum inter-atomic distance in the new
supercell (i.e. the most bulk-like arrangement).
- Re-orient this new stenciled supercell to match the orientation and site
positions of ``target_supercell``.
- If ``target_frac_coords`` is not ``False``, scan over all symmetry
operations of ``target_supercell`` and apply that which places the defect
site closest to ``target_frac_coords``.
Args:
defect_entry (DefectEntry):
A ``DefectEntry`` object for which to re-generate the relaxed
structure (taken from ``DefectEntry.defect_supercell``) in the
``target_supercell`` lattice.
target_supercell (Structure):
The supercell structure to re-generate the relaxed defect structure
in.
check_bulk (bool):
Whether to check if the generated defect/bulk supercells have
different atomic position bases to ``target_supercell`` (as
described above) -- if so, a warning will be printed (unless
``check_bulk`` is ``False``). Default is ``True``.
target_frac_coords (Union[np.ndarray[float], list[float], bool]):
The fractional coordinates to target for defect placement in the
new supercell. If just set to ``True`` (default), will try to place
the defect nearest to the centre of the superset cell (i.e.
``target_frac_coords = [0.5, 0.5, 0.5]``), as is default in
``doped`` defect generation. Note that defect placement is harder
in this case than in generation with ``DefectsGenerator``, as we
are not starting from primitive cells and we are working with
relaxed geometries.
edge_tol (float):
A tolerance (in Angstrom) for site displacements at the edge of the
stenciled supercell, when determining the best match of sites to
stencil out in the new supercell (of ``target_supercell``
dimension). Default is 1 Angstrom, and then this is sequentially
increased up to 4.5 Angstrom if the initial scan fails.
min_dist_tol_factor (float):
Tolerance factor when checking the minimum interatomic distance in
the stenciled defect supercell, as a factor of the minimum distance
in the original ``DefectEntry.defect_supercell``. Default is 0.99.
Returns:
tuple[Structure, Structure]:
The re-generated defect supercell in the ``target_supercell``
lattice, and the corresponding bulk/reference supercell for the
generated defect supercell (see explanations above).
"""
# Note to self; using Pycharm breakpoints throughout is likely easiest for debugging (TODO)
# TODO: Tests!! (At least one of each defect, Se good test case, then at least one or two with
# >unary compositions and extrinsic substitution/interstitial)
# TODO: We should now be able to use these functions (without the final re-orientation step,
# for speed) to determine the point symmetries of relaxed defects in non-symmetry-conserving
# supercells, by stenciling into a small symmetry-conserving cell and getting the point symmetry
# for that -- will do!
# TODO: When releasing, note that this code is useful for supercell size convergence tests, and
# accelerating `ShakeNBreak` etc. If/when adding, make sure to link in `SnB` docs as well. Should have
# some notes in docs/tutorials about this functionality, maybe with a quick example (e.g. Se results).
# Likewise, the displacements functions are useful as a validation / check of supercell size
# convergence, and for quantifying the strain / distortion introduced by a certain defect

pbar = tqdm(
total=100, bar_format="{desc}{percentage:.1f}%|{bar}| [{elapsed}, {rate_fmt}{postfix}]"
) # tqdm progress bar. 100% is completion
pbar.set_description("Getting super-supercell (relaxed defect + bulk sites)")

bulk_mismatch_warning = False

try:
orig_supercell = _get_defect_supercell(defect_entry)
orig_min_dist = min_dist(orig_supercell)
min_dist_tol = orig_min_dist * min_dist_tol_factor
orig_bulk_supercell = _get_bulk_supercell(defect_entry)
orig_defect_frac_coords = defect_entry.sc_defect_frac_coords
target_supercell = target_supercell.copy()
bulk_min_bond_length = min_dist(orig_bulk_supercell)
bulk_min_dist_tol = bulk_min_bond_length * min_dist_tol_factor
target_frac_coords = [0.5, 0.5, 0.5] if target_frac_coords is True else target_frac_coords

# ensure no oxidation states (for easy composition matching later)
for struct in [orig_supercell, orig_bulk_supercell, target_supercell]:
struct.remove_oxidation_states()

# first translate both orig supercells to put defect in the middle, to aid initial stenciling:
orig_def_to_centre = np.array([0.5, 0.5, 0.5]) - orig_defect_frac_coords
orig_supercell = translate_structure(orig_supercell, orig_def_to_centre, frac_coords=True)
trans_orig_bulk_supercell = translate_structure(
orig_bulk_supercell, orig_def_to_centre, frac_coords=True
)

# get big_supercell, which is expanded version of orig_supercell so that _each_ lattice vector is
# now bigger than the _largest_ lattice vector in target_supercell (so that target_supercell is
# fully encompassed by big_supercell):
superset_matrix, big_supercell, big_supercell_with_X, orig_supercell_with_X = ( # supa-set!!
_get_superset_matrix_and_supercells(orig_supercell, target_supercell, [0.5, 0.5, 0.5])
)
big_bulk_supercell = orig_bulk_supercell * superset_matrix # get big bulk supercell

# this big_supercell is with the defect now repeated in it, but we want it with just one defect,
# so only take the first repeated defect supercell, and then the rest of the sites from the
# expanded bulk supercell:
# only keep atoms in big supercell which are within the original supercell bounds:
big_defect_supercell = _get_matching_sites_from_s1_then_s2(
orig_supercell_with_X,
big_supercell_with_X,
trans_orig_bulk_supercell * superset_matrix,
min_dist_tol,
)

# translate structure to put defect at the centre of the big supercell (w/frac_coords)
big_supercell_defect_site = next(s for s in big_defect_supercell.sites if s.specie.symbol == "X")
def_to_centre = np.array([0.5, 0.5, 0.5]) - big_supercell_defect_site.frac_coords
big_defect_supercell = translate_structure(big_defect_supercell, def_to_centre, frac_coords=True)

pbar.update(20) # 20% of progress bar
pbar.set_description("Getting sites in border region")

# get all atoms in big supercell within the cartesian bounds of the target_supercell supercell:
new_defect_supercell = _stencil_target_cell_from_big_cell(
big_defect_supercell,
target_supercell,
bulk_min_bond_length=bulk_min_bond_length,
min_dist_tol=min_dist_tol,
edge_tol=edge_tol,
pbar=pbar,
)
new_bulk_supercell = _stencil_target_cell_from_big_cell(
big_bulk_supercell,
target_supercell,
bulk_min_bond_length=bulk_min_bond_length,
min_dist_tol=bulk_min_dist_tol,
edge_tol=1e-3,
pbar=None,
) # shouldn't need `edge_tol`, should be much faster than defect supercell stencil

pbar.update(15) # 55% of progress bar
pbar.set_description("Ensuring matching orientation w/target_supercell")

# now we just do get s2 like s1 to get the orientation right:
defect_site = next(site for site in new_defect_supercell if site.specie.symbol == "X")

# Note: this function is typically the main bottleneck in this workflow. We have already
# optimised the underlying ``StructureMatcher`` workflow in many ways (caching,
# fast structure/site/composition comparisons, skipping comparison of defect neighbourhood to
# reduce requisite ``stol`` etc; being many orders of magnitude faster than the base
# ``pymatgen`` ``StructureMatcher``), however the ``_cart_dists()`` function call is
# still quite expensive, especially with large structures with significant noise in the atomic
# positions...
new_defect_supercell_w_defect_neighbours_as_X = _convert_defect_neighbours_to_X(
new_defect_supercell, defect_site.frac_coords, coords_are_cartesian=False
)
with warnings.catch_warnings():
warnings.filterwarnings("ignore", message="Not all sites have property orig_species")
# first we orient the generated _bulk_ supercell to match the ``target_supercell``,
# to try ensure consistency in the generated supercells
oriented_new_bulk_supercell = orient_s2_like_s1( # speed should be >= defect orienting
target_supercell,
new_bulk_supercell,
verbose=False,
allow_subset=True,
)
# return new_defect_supercell, _round_struct_coords(oriented_new_bulk_supercell,
# dist_precision=0.01, to_unit_cell=True)
oriented_new_defect_supercell_w_defect_neighbours_as_X = orient_s2_like_s1(
oriented_new_bulk_supercell,
new_defect_supercell_w_defect_neighbours_as_X,
verbose=False,
ignored_species=["X"], # ignore X site
allow_subset=True, # allow defect supercell composition to differ
)
oriented_new_defect_supercell = _convert_X_back_to_orig_species(
oriented_new_defect_supercell_w_defect_neighbours_as_X
)
oriented_new_defect_site = next( # get defect site and remove X from sites
oriented_new_defect_supercell.pop(i)
for i, site in enumerate(oriented_new_defect_supercell.sites)
if site.specie.symbol == "X"
)

pbar.update(35) # 90% of progress bar

if target_frac_coords is not False:
pbar.set_description("Placing defect closest to target_frac_coords")

target_symm_op = _scan_symm_ops_to_place_site_closest_to_frac_coords(
target_supercell, oriented_new_defect_site, target_frac_coords
)
oriented_new_defect_supercell = get_clean_structure(
apply_symm_op_to_struct( # apply symm_op to structure
target_symm_op,
oriented_new_defect_supercell,
fractional=True,
rotate_lattice=False,
),
) # reordered inputs in updated doped
if check_bulk:
bulk_mismatch_warning = not check_atom_mapping_far_from_defect(
target_supercell,
oriented_new_bulk_supercell,
oriented_new_defect_site.frac_coords,
warning=False,
)

oriented_new_bulk_supercell = get_clean_structure(
apply_symm_op_to_struct(
target_symm_op,
oriented_new_bulk_supercell,
fractional=True,
rotate_lattice=False,
),
)

pbar.update(pbar.total - pbar.n) # set to 100% of progress bar

except Exception as e:
pbar.close()
raise e

finally:
pbar.close()

if bulk_mismatch_warning: # print warning after closing pbar; cleaner
warnings.warn(
"Note that the atomic position basis of the generated defect/bulk supercell "
"differs from that of the ``target_supercell``. This is likely fine, "
"and just due to differences in (symmetry-equivalent) primitive cell definitions "
"(e.g. {'Cd': [0,0,0], 'Te': [0.25,0.25,0.25]} vs "
"{'Cd(': [0,0,0], 'Te': [0.75,0.75,0.75]}``) -- see ``get_defect_in_supercell`` "
"docstring for more info. For accurate finite-size charge corrections when "
"parsing, a matching bulk and defect supercell should be used, and so the "
"matching bulk supercell for the generated defect supercell (also returned "
"by this function) should be used for its reference host cell calculation.",
)

_check_min_dist(oriented_new_defect_supercell, min_dist_tol) # check distances are reasonable
_check_min_dist(oriented_new_bulk_supercell, bulk_min_dist_tol)
return oriented_new_defect_supercell, oriented_new_bulk_supercell

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider breaking down this complex function.

With 56 statements and 36 local variables, this function is difficult to understand and maintain. The extensive inline comments and progress bar updates suggest natural boundaries for extraction.

Consider extracting these logical sections into separate functions:

  1. Super-supercell generation (lines 175-220)
  2. Stenciling logic (lines 224-240)
  3. Orientation matching (lines 245-284)
  4. Defect placement (lines 287-317)

This would improve testability and make the overall flow clearer.

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 44-44: Too many arguments (6/5)

(R0913)


[refactor] 44-44: Too many positional arguments (6/5)

(R0917)


[refactor] 44-44: Too many local variables (36/15)

(R0914)


[error] 48-48: Value 'np.ndarray' is unsubscriptable

(E1136)


[refactor] 44-44: Too many statements (56/50)

(R0915)

🤖 Prompt for AI Agents
In doped/utils/stenciling.py around lines 44 to 343, the get_defect_in_supercell
function is overly long and complex with 56 statements and 36 local variables,
making it hard to understand and maintain. Refactor by extracting the following
logical sections into separate helper functions: super-supercell generation
(lines 175-220), stenciling logic (lines 224-240), orientation matching (lines
245-284), and defect placement (lines 287-317). Replace the corresponding code
blocks in the main function with calls to these new functions to improve
readability, testability, and maintainability.

def _scan_symm_ops_to_place_site_closest_to_frac_coords(
symm_ops: Structure | Sequence[SymmOp],
site: PeriodicSite,
target_frac_coords: np.ndarray[float] | list[float] | None = None,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix invalid numpy array type hint syntax.

Same issue as above with invalid numpy type hint.

-    target_frac_coords: np.ndarray[float] | list[float] | None = None,
+    target_frac_coords: np.ndarray | list[float] | None = None,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
target_frac_coords: np.ndarray[float] | list[float] | None = None,
target_frac_coords: np.ndarray | list[float] | None = None,
🧰 Tools
🪛 Pylint (3.3.7)

[error] 348-348: Value 'np.ndarray' is unsubscriptable

(E1136)

🤖 Prompt for AI Agents
In doped/utils/stenciling.py at line 348, the type hint for a numpy array is
incorrectly specified as np.ndarray[float]. Replace this with the correct syntax
using numpy.typing.NDArray, such as numpy.typing.NDArray[np.float64], or use a
more general type hint compatible with numpy arrays. Adjust the import
statements if necessary to include numpy.typing for proper type hinting.

defect_entry: DefectEntry,
target_supercell: Structure,
check_bulk: bool = True,
target_frac_coords: np.ndarray[float] | list[float] | bool = True,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix invalid numpy array type hint syntax.

The type hint np.ndarray[float] is not valid Python syntax. NumPy arrays don't support direct subscript notation in type hints.

-    target_frac_coords: np.ndarray[float] | list[float] | bool = True,
+    target_frac_coords: np.ndarray | list[float] | bool = True,

For more specific typing, consider using numpy.typing.NDArray[np.float64] with appropriate imports.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
target_frac_coords: np.ndarray[float] | list[float] | bool = True,
target_frac_coords: np.ndarray | list[float] | bool = True,
🧰 Tools
🪛 Pylint (3.3.7)

[error] 48-48: Value 'np.ndarray' is unsubscriptable

(E1136)

🤖 Prompt for AI Agents
In doped/utils/stenciling.py at line 48, the type hint np.ndarray[float] is
invalid syntax. Replace it with numpy.typing.NDArray[np.float64] after importing
NDArray from numpy.typing and np.float64 from numpy to correctly specify the
array type in the type hint.

@kavanase kavanase force-pushed the split_vacancies branch 2 times, most recently from b304253 to f0d7be7 Compare June 26, 2025 22:21
…defect_type_and_site_indices` and move unrelaxed defect structure code out to `defect_from_structures`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants