Skip to content

Commit 2db49af

Browse files
committed
Merge branch 'materials-genomics-content' into mg-merge
2 parents 59789eb + 6d0a667 commit 2db49af

File tree

17 files changed

+567
-107
lines changed

17 files changed

+567
-107
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -178,3 +178,4 @@ cython_debug/
178178
.pypirc
179179

180180
/.quarto/
181+
.worktrees/

materials_genomics/01_intro/unit1_content_50slides.md

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,23 @@
33
## Unit theme
44
**Materials data as a design space: from datasets to discovery loops**
55

6-
## Core source mapping (book-priority aligned)
7-
- **Neuer (2024)**: model framing, explainability, criticism of pure black-box use.
8-
- **Sandfeld (2024)**: Ch. 2.1–2.3 (data science in materials, domain knowledge, data→information→knowledge).
9-
- **McClarren (2021)**: Ch. 1 (ML landscape; validation basics; Bayesian view).
10-
- **Murphy (2012)**: Ch. 1 (task definitions, model selection language).
11-
- **Bishop (2006)**: Ch. 1 (probabilistic framing and model-selection mindset).
6+
## Book-backed content summary (for this unit)
7+
- Materials genomics treats composition and crystal structure as a searchable design space rather than a fixed list of known compounds.
8+
- Databases and simulations provide candidate structures, target properties, and method metadata that together define the learning problem.
9+
- Model choice in scientific discovery must remain tied to domain knowledge, uncertainty, and explainability rather than pure benchmark accuracy.
10+
- Materials discovery tasks combine regression, ranking, classification, and screening logic within one validation-aware workflow.
11+
- Provenance, dataset bias, and leakage determine whether a discovery claim is scientifically defensible.
12+
13+
## Source anchors used
14+
- Neuer 1.1-1.3
15+
- Sandfeld 2.1-2.3
16+
- McClarren Ch1
17+
- Murphy Ch1
18+
- Bishop Ch1
1219

1320
---
1421

15-
## Slide-by-slide content (target: 50)
22+
## 50-slide scaffold
1623

1724
### Block A — Why materials genomics? (Slides 1–8)
1825
1. **Title + course role in program**

materials_genomics/01_intro/unit1_plan.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,3 +39,36 @@ By the end of this unit, students can:
3939
## Assessment alignment
4040
- Written exam: conceptual precision (not coding trivia)
4141
- Students should be able to defend model/data choices scientifically.
42+
43+
## Required chapter files
44+
- Neuer:
45+
- `neuer-machine-learning-for-engineers/markdown/01-data-as-the-basis-of-models.qmd` (1.1-1.3)
46+
- Sandfeld:
47+
- `sandfeld-materials-data-science/markdown/04-part-i-introduction-and-foundations.qmd` (2.1-2.3)
48+
- McClarren:
49+
- `mcclarren-machine-learning-for-engineers/markdown/01-the-landscape-of-machine-learning-supervised-and-unsupervised-learning-optimization-and-other-topics.qmd`
50+
- Bishop:
51+
- `bishop-pattern-recognition-and-machine-learning-2006/markdown/05-introduction.qmd`
52+
- Murphy:
53+
- `murphy-machine-learning-a-probabilistic-perspective-2012/markdown/08-introduction.qmd`
54+
55+
## Cross-book summary target
56+
- Start from Neuer's distinction between white-box, grey-box, and black-box models and explain why materials genomics cannot treat screening as a pure black-box exercise.
57+
- Use Sandfeld to define materials data science as a domain-knowledge-guided workflow from data to information to knowledge.
58+
- Use McClarren, Bishop, and Murphy only to stabilize the ML vocabulary: task definition, model selection, validation, and scientific interpretation.
59+
- Keep the focus on databases, discovery loops, and validity criteria rather than on algorithm derivations.
60+
- Exclude detailed probability theory and optimization proofs; they belong to MFML.
61+
62+
## 50-slide strategy
63+
- Slides 1-8: course position, genomics analogy, learning objectives, discovery bottleneck.
64+
- Slides 9-18: design-space framing, PSPP graph, targets, surrogate-model logic.
65+
- Slides 19-30: database landscape, data objects, provenance, bias, and leakage.
66+
- Slides 31-41: regression/classification/ranking tasks, grouped validation, uncertainty-aware decisions.
67+
- Slides 42-50: exercise handoff, reporting checklist, exam-relevant summary statements.
68+
69+
## Website summary update
70+
- Heading: `#### Week 1 – What is Materials Genomics? (14.04.2026)`
71+
- Add a short summary emphasizing:
72+
- materials genomes as searchable composition-structure spaces,
73+
- databases plus simulations as the data substrate,
74+
- the need for validation, uncertainty, and domain knowledge in discovery claims.

materials_genomics/02_crystal_structure_fundamentals/unit2_content_50slides.md

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,20 @@
1-
# Materials Genomics Unit 2 — 50-Slide Scaffold Pack
1+
# Materials Genomics Unit 2 — 50-Slide Teaching Scaffold (book-backed)
22

3-
## Slide-by-slide scaffold
3+
## Book-backed content summary (for this unit)
4+
- Crystal structures become ML-ready only after careful choices about lattice, basis, coordinate systems, and periodic representation.
5+
- Symmetry reduces redundant degrees of freedom but also creates canonicalization and leakage challenges when equivalent structures appear multiple times.
6+
- CIF-like containers mix geometry, chemistry, and metadata; these fields must be parsed into structured representations without discarding provenance.
7+
- Low-dimensional organization of structural data helps students see why crystal families and prototypes cluster, but these projections are not substitutes for crystallographic reasoning.
8+
- The unit prepares students for descriptor and graph models by clarifying what information a crystal representation must preserve.
9+
10+
## Source anchors used
11+
- Neuer 1.2.3, 1.2.7, 5.2
12+
- Sandfeld 3.3
13+
- McClarren Ch4
14+
- Bishop Ch12
15+
- Murphy Ch19
16+
17+
## 50-slide scaffold
418

519
1. **Title: Crystal Structure Fundamentals**
620
- Unit scope and role.

materials_genomics/02_crystal_structure_fundamentals/unit2_plan.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,37 @@ By the end of Unit 2 students can:
3535
2. Sandfeld: materials data science + domain integration
3636
3. McClarren: practical ML task structure
3737
4. Murphy/Bishop: validation and probabilistic interpretation
38+
39+
## Required chapter files
40+
- Neuer:
41+
- `neuer-machine-learning-for-engineers/markdown/01-data-as-the-basis-of-models.qmd` (1.2.3, 1.2.7)
42+
- `neuer-machine-learning-for-engineers/markdown/05-unsupervised-learning.qmd` (5.2)
43+
- Sandfeld:
44+
- `sandfeld-materials-data-science/markdown/04-part-i-introduction-and-foundations.qmd` (3.3, structured/tabular data)
45+
- McClarren:
46+
- `mcclarren-machine-learning-for-engineers/markdown/04-finding-structure-within-a-data-set-data-reduction-and-clustering.qmd`
47+
- Bishop:
48+
- `bishop-pattern-recognition-and-machine-learning-2006/markdown/16-continuous-latent-variables.qmd`
49+
- Murphy:
50+
- `murphy-machine-learning-a-probabilistic-perspective-2012/markdown/19-latent-linear-models.qmd`
51+
52+
## Cross-book summary target
53+
- Use Sandfeld and Neuer to explain how crystal structures must be represented as structured data objects before any ML method can act on them.
54+
- Use McClarren, Murphy, and Bishop only to motivate low-dimensional structure, covariance views, and representation choices, not to teach crystallography itself.
55+
- Keep the domain core on lattice, basis, periodicity, symmetry, and invariance requirements for ML-ready crystal data.
56+
- Explain why representation choices change both model accuracy and leakage risk.
57+
- Exclude formal derivations of PCA and latent-variable models; students already meet the mathematics in MFML.
58+
59+
## 50-slide strategy
60+
- Slides 1-10: structural vocabulary, lattices, bases, unit cells, coordinate systems.
61+
- Slides 11-22: primitive vs conventional cells, symmetry, periodicity, invariances.
62+
- Slides 23-34: CIF/POSCAR-style encodings, tabularization, low-dimensional structure intuition.
63+
- Slides 35-44: data-quality risks, polymorph/prototype leakage, grouped splits.
64+
- Slides 45-50: CIF-to-feature exercise setup and exam checklist.
65+
66+
## Website summary update
67+
- Heading: `#### Week 2 – Simulation methods as data generators (21.04.2026)`
68+
- Add or revise the summary so Week 2 bridges simulation outputs to ML-ready crystal representations:
69+
- structures as data objects with periodic constraints,
70+
- symmetry and coordinate choices,
71+
- low-dimensional organization of crystal data.

materials_genomics/03_materials_databases/unit3_plan.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,3 +28,37 @@ By the end of Unit 3, students can:
2828
- compare two methodological choices under identical split protocol
2929
- perform one structured failure analysis and mitigation proposal
3030
- produce a short report with claims, evidence, and limitations
31+
32+
## Required chapter files
33+
- Neuer:
34+
- `neuer-machine-learning-for-engineers/markdown/01-data-as-the-basis-of-models.qmd` (1.3)
35+
- `neuer-machine-learning-for-engineers/markdown/04-supervised-learning.qmd` (4.2.2, 4.2.3, 4.4.1)
36+
- Sandfeld:
37+
- `sandfeld-materials-data-science/markdown/04-part-i-introduction-and-foundations.qmd` (2.2, 4.5)
38+
- McClarren:
39+
- `mcclarren-machine-learning-for-engineers/markdown/01-the-landscape-of-machine-learning-supervised-and-unsupervised-learning-optimization-and-other-topics.qmd`
40+
- Bishop:
41+
- `bishop-pattern-recognition-and-machine-learning-2006/markdown/07-linear-models-for-regression.qmd`
42+
- Murphy:
43+
- `murphy-machine-learning-a-probabilistic-perspective-2012/markdown/14-linear-regression.qmd`
44+
45+
## Cross-book summary target
46+
- Use Neuer to frame database targets as supervised-learning objects with explicit train/test logic and careful normalization.
47+
- Use Sandfeld to connect materials datasets to provenance, domain context, and structured records rather than isolated numbers.
48+
- Use Bishop and Murphy selectively for regression language around targets, residuals, and fair comparison.
49+
- Keep the unit centered on real materials records: composition, structure, calculation metadata, formation energy, and energy above hull.
50+
- Exclude extensive regression derivations; the priority is scientific meaning of database fields, convex-hull logic, and reproducible queries.
51+
52+
## 50-slide strategy
53+
- Slides 1-8: database ecosystem and why database choice changes conclusions.
54+
- Slides 9-20: schema elements, identifiers, composition/structure/provenance fields.
55+
- Slides 21-32: formation energy, energy above hull, convex hulls, metastability, synthesis caveats.
56+
- Slides 33-42: confounders from functionals, cutoffs, duplicate structures, normalization choices.
57+
- Slides 43-50: API query workflow, reproducible snapshotting, exercise briefing.
58+
59+
## Website summary update
60+
- Heading: `#### Week 3 – Atomistic and electronic simulations (DFT, MD, MC) (28.04.2026)`
61+
- Add a summary emphasizing:
62+
- which atomistic simulations populate materials databases,
63+
- how thermodynamic targets are constructed from those calculations,
64+
- why method metadata and reference states matter for ML.

materials_genomics/04_classical_descriptors/unit4_plan.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,3 +28,38 @@ By the end of Unit 4, students can:
2828
- compare two methodological choices under identical split protocol
2929
- perform one structured failure analysis and mitigation proposal
3030
- produce a short report with claims, evidence, and limitations
31+
32+
## Required chapter files
33+
- Neuer:
34+
- `neuer-machine-learning-for-engineers/markdown/05-unsupervised-learning.qmd` (5.5)
35+
- Sandfeld:
36+
- `sandfeld-materials-data-science/markdown/04-part-i-introduction-and-foundations.qmd` (2.2, 2.4)
37+
- `sandfeld-materials-data-science/markdown/06-part-iii-classical-machine-learning.qmd` (feature matrices, regression context)
38+
- McClarren:
39+
- `mcclarren-machine-learning-for-engineers/markdown/04-finding-structure-within-a-data-set-data-reduction-and-clustering.qmd`
40+
- `mcclarren-machine-learning-for-engineers/markdown/08-unsupervised-learning-with-neural-networks-autoencoders.qmd`
41+
- Bishop:
42+
- `bishop-pattern-recognition-and-machine-learning-2006/markdown/16-continuous-latent-variables.qmd`
43+
- Murphy:
44+
- `murphy-machine-learning-a-probabilistic-perspective-2012/markdown/19-latent-linear-models.qmd`
45+
46+
## Cross-book summary target
47+
- Use Sandfeld to motivate descriptor engineering through domain knowledge, feature matrices, and curse-of-dimensionality effects.
48+
- Use Neuer plus McClarren to explain why autoencoder-style learned representations become attractive when hand-crafted features saturate.
49+
- Use Bishop and Murphy only for latent-variable intuition, not for deep theory.
50+
- Keep the materials focus on descriptor families such as Magpie and matminer, invariance requirements, and failure modes from multicollinearity or missing nonlocal physics.
51+
- Exclude architecture-specific training detail; that belongs in later neural-network units.
52+
53+
## 50-slide strategy
54+
- Slides 1-10: descriptor purpose, chemistry and structure feature families, invariance requirements.
55+
- Slides 11-22: Magpie/matminer examples, scaling, normalization, correlation, and sparsity.
56+
- Slides 23-34: where classical descriptors succeed, where they fail, and why.
57+
- Slides 35-44: transition to learned representations, latent-variable intuition, transferability.
58+
- Slides 45-50: descriptor-vs-learned-representation exercise and summary.
59+
60+
## Website summary update
61+
- Heading: `#### Week 4 – Continuum simulations, thermodynamics, and stability (05.05.2026)`
62+
- Add a summary that links stability concepts to representation choices:
63+
- descriptors as compressed carriers of chemistry and structure,
64+
- interpretable but limited hand-crafted features,
65+
- motivation for moving toward learned representations in later weeks.

materials_genomics/05_graph_based_rep/unit5_plan.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,3 +28,36 @@ By the end of Unit 5, students can:
2828
- compare two methodological choices under identical split protocol
2929
- perform one structured failure analysis and mitigation proposal
3030
- produce a short report with claims, evidence, and limitations
31+
32+
## Required chapter files
33+
- Neuer:
34+
- `neuer-machine-learning-for-engineers/markdown/04-supervised-learning.qmd` (4.5.1-4.5.5)
35+
- Sandfeld:
36+
- `sandfeld-materials-data-science/markdown/04-part-i-introduction-and-foundations.qmd` (2.2, 3.3)
37+
- McClarren:
38+
- `mcclarren-machine-learning-for-engineers/markdown/05-feed-forward-neural-networks.qmd`
39+
- Bishop:
40+
- `bishop-pattern-recognition-and-machine-learning-2006/markdown/09-neural-networks.qmd`
41+
- Murphy:
42+
- `murphy-machine-learning-a-probabilistic-perspective-2012/markdown/35-deep-learning.qmd`
43+
44+
## Cross-book summary target
45+
- Use Neuer, McClarren, Bishop, and Murphy to supply the neural-network language needed to explain message passing as a learned nonlinear update rule.
46+
- Use Sandfeld to keep the materials focus on structure encoding, domain constraints, and why graph objects are natural for crystals.
47+
- Emphasize node, edge, and global attributes; periodic boundary conditions; and readout functions for property prediction.
48+
- Compare graph models to descriptor baselines without getting lost in architecture trivia.
49+
- Exclude advanced GNN math and equivariant formalism beyond intuition.
50+
51+
## 50-slide strategy
52+
- Slides 1-10: why crystals become graphs, atoms/bonds/global state, periodicity.
53+
- Slides 11-22: cutoff construction, edge features, invariance and equivariance intuition.
54+
- Slides 23-34: message passing, CGCNN, MEGNet, SchNet-style intuition.
55+
- Slides 35-44: over-smoothing, readout choices, transferability, shortcut-learning failures.
56+
- Slides 45-50: graph-construction exercise and recap.
57+
58+
## Website summary update
59+
- Heading: `#### Week 6 – Graph-based crystal representations (19.05.2026)`
60+
- Add a summary covering:
61+
- crystals as periodic graphs,
62+
- message passing as a structure-property learning mechanism,
63+
- the role of cutoffs, invariances, and grouped evaluation.

materials_genomics/06_local_atomic_envs/unit6_plan.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,3 +28,36 @@ By the end of Unit 6, students can:
2828
- compare two methodological choices under identical split protocol
2929
- perform one structured failure analysis and mitigation proposal
3030
- produce a short report with claims, evidence, and limitations
31+
32+
## Required chapter files
33+
- Neuer:
34+
- `neuer-machine-learning-for-engineers/markdown/06-physics-informed-learning.qmd` (6.2, 6.3)
35+
- Sandfeld:
36+
- `sandfeld-materials-data-science/markdown/04-part-i-introduction-and-foundations.qmd` (2.2, 3.3)
37+
- McClarren:
38+
- `mcclarren-machine-learning-for-engineers/markdown/02-linear-models-for-regression-and-classification.qmd`
39+
- Bishop:
40+
- `bishop-pattern-recognition-and-machine-learning-2006/markdown/10-kernel-methods.qmd`
41+
- Murphy:
42+
- `murphy-machine-learning-a-probabilistic-perspective-2012/markdown/21-kernels.qmd`
43+
44+
## Cross-book summary target
45+
- Use Neuer and Sandfeld to stress that local-environment features are a form of domain-guided data enrichment rather than arbitrary preprocessing.
46+
- Use Bishop and Murphy to motivate kernel and similarity language for SOAP-like descriptors.
47+
- Keep the materials content on coordination numbers, Voronoi views, atom-centered descriptors, and aggregation from local to material-level features.
48+
- Show local descriptors as the bridge between interpretable classical features and learned graph representations.
49+
- Exclude full kernel derivations and spherical-harmonic details.
50+
51+
## 50-slide strategy
52+
- Slides 1-10: local vs global structure, neighbor shells, coordination environments.
53+
- Slides 11-22: bond-length/bond-angle views, Voronoi tessellations, atom-centered features.
54+
- Slides 23-34: SOAP/ACSF intuition, kernel similarity, aggregation to material-level vectors.
55+
- Slides 35-44: defects, noise sensitivity, local-environment failure modes, transfer limits.
56+
- Slides 45-50: descriptor computation exercise and recap.
57+
58+
## Website summary update
59+
- Heading: `#### Week 7 – Local atomic environments (26.05.2026)`
60+
- Add a summary covering:
61+
- local descriptors as ML-ready fingerprints,
62+
- Voronoi/SOAP intuition,
63+
- the bridge from interpretable environments to richer learned representations.

materials_genomics/07_regression_and_generalization_in_materials_data/unit7_plan.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,3 +28,37 @@ By the end of Unit 7, students can:
2828
- compare two methodological choices under identical split protocol
2929
- perform one structured failure analysis and mitigation proposal
3030
- produce a short report with claims, evidence, and limitations
31+
32+
## Required chapter files
33+
- Neuer:
34+
- `neuer-machine-learning-for-engineers/markdown/04-supervised-learning.qmd` (4.2.2, 4.2.3, 4.5.9)
35+
- Sandfeld:
36+
- `sandfeld-materials-data-science/markdown/06-part-iii-classical-machine-learning.qmd` (12-13)
37+
- McClarren:
38+
- `mcclarren-machine-learning-for-engineers/markdown/02-linear-models-for-regression-and-classification.qmd`
39+
- `mcclarren-machine-learning-for-engineers/markdown/03-decision-trees-and-random-forests-for-regression-and-classification.qmd`
40+
- Bishop:
41+
- `bishop-pattern-recognition-and-machine-learning-2006/markdown/07-linear-models-for-regression.qmd`
42+
- Murphy:
43+
- `murphy-machine-learning-a-probabilistic-perspective-2012/markdown/14-linear-regression.qmd`
44+
45+
## Cross-book summary target
46+
- Use Neuer to define regression tasks, train/test logic, and overfitting control.
47+
- Use Sandfeld, Bishop, and Murphy for regression geometry, feature matrices, basis functions, and the bias-variance trade-off.
48+
- Use McClarren for baseline model families that work well in engineering settings.
49+
- Keep the unit centered on fair comparison of materials-property predictors under grouped chemistry-aware validation.
50+
- Exclude extended statistical proofs and Bayesian derivations; keep the discussion operational.
51+
52+
## 50-slide strategy
53+
- Slides 1-10: target selection, regression framing, baseline metrics.
54+
- Slides 11-22: linear and regularized models, basis expansions, interpretability.
55+
- Slides 23-34: tree/ensemble baselines, grouped splits, cross-validation, residual analysis.
56+
- Slides 35-44: bias-variance, overfitting signatures, OOD behavior in chemical space.
57+
- Slides 45-50: baseline-comparison exercise and summary.
58+
59+
## Website summary update
60+
- Heading: `#### Week 8 – Regression and generalization in materials data (02.06.2026)`
61+
- Add a summary covering:
62+
- regression as empirical-risk minimization for materials targets,
63+
- grouped validation and generalization gaps,
64+
- why chemistry-aware splits matter more than raw accuracy.

0 commit comments

Comments
 (0)