Skip to content

Refactor Sem multi-group support#317

Open
alyst wants to merge 43 commits intoStructuralEquationModels:develfrom
alyst:refactor_sem_terms
Open

Refactor Sem multi-group support#317
alyst wants to merge 43 commits intoStructuralEquationModels:develfrom
alyst:refactor_sem_terms

Conversation

@alyst
Copy link
Copy Markdown
Contributor

@alyst alyst commented Mar 9, 2026

This is a largest remaining part of #193, which changes some interfaces.

Refactoring of the SEM types

  • AbstractLoss is the base type for all functions
  • SemLoss{O,I} <: AbstractLoss is the base type for all SEM losses, it now requires to have observed::O and implied::I field
  • Since SemLoss ctor should always be given observed and implied (positional), meanstructure keyword is gone -- loss should always respect implied specification.
  • LossTerm is a thin wrapper around AbstractLoss that adds optional id of the loss term and optional weight
  • Sem is a container of LossTerm objects (accessible via loss_terms(sem), or loss_term(sem, id)), so it can handle multiple SEM terms (accessible via sem_terms(sem) -- subset of loss_terms(sem), or sem_term(sem, id)).
    It replaces both the old Sem and SemEnsemble.
  • AbstractSingleSem, AbstractSemCollection and SemEnsemble are gone.

Method changes

Multi-term SEMs could be created like

model = Sem(
    :Pasteur => SemML(obs_g1, RAMSymbolic(specification_g1)),
    :Grant_White => SemML(obs_g2, RAM(specification_g2)),
    ...
)

Or with weights specification

model = Sem(
    :Pasteur => SemML(obs_g1, RAMSymbolic(specification_g1)) => 0.5,
    :Grant_White => SemML(obs_g2, RAM(specification_g2)) => 0.6,
)

The new Sem() and loss-term constructors rely less on keyword arguments and more on positional arguments, but some keywords support is present.

  • update_observed!() was removed. It was only used by replace_observed(),
    but otherwise in-place model modification with unclear semantics is error-prone.
  • replace_observed(sem, data) was simplified by removing support of additional keywords or requirement to pass SEM specification.
    It only creates a copy of the given Sem with the observed data replaced,
    but implied and loss definitions intact.
    Changing observed vars is not supported -- that is something use-case specific
    that user should implement in their code.
  • check_single_lossfun() was renamed into check_same_semterm_type() as
    it better describes what it does. If check is successful, it returns the specific
    subtype of SemLoss.
  • bootstrap() and se_bootstrap() use bootstrap!(acc::BootstrapAccumulator, ...)
    function to reduce code duplication
  • bootstrap() returns BootstrapResult{T} for better type inference
  • fit_measures() now also accepts vector of functions, and includes CFI by default (DEFAULT_FIT_MEASURES constant)
  • test_fitmeasures() was tweaked to handle more repetitive code: calculating the subset of fit measures, and compairing this subset against lavaan refs, checking for measures that could not be applied to given loss types (SemWLS).

@alyst alyst changed the base branch from main to devel March 9, 2026 22:48
@alyst alyst force-pushed the refactor_sem_terms branch from 3c39941 to 32cea82 Compare March 11, 2026 20:31
Alexey Stukalov and others added 3 commits March 21, 2026 11:03
- for SemImplied require spec::SemSpec as positional
- for SemLossFunction require implied argument
@alyst alyst force-pushed the refactor_sem_terms branch 2 times, most recently from eb039a2 to 88a1ff0 Compare March 23, 2026 07:33
@alyst alyst changed the title Refactor Sem mult-group support Refactor Sem multi-group support Mar 23, 2026
@alyst alyst marked this pull request as ready for review March 23, 2026 08:12
@alyst alyst force-pushed the refactor_sem_terms branch from 88a1ff0 to 0406f29 Compare March 23, 2026 17:51
Comment thread docs/src/tutorials/collection/collection.md Outdated
@alyst
Copy link
Copy Markdown
Contributor Author

alyst commented Mar 24, 2026

@Maximilian-Stefan-Ernst It might be a nice idea to use copilot for catching typos, incorrect sentences, but also potential bugs.
I cannot select copilot as a reviewer -- I'm not exactly sure why, whether it is the organization/repository-level setting, or it's my status in the repository.
But I'm also fine if SEM.jl is kept AI-free :)

Comment thread src/frontend/fit/fitmeasures/chi2.jl Outdated
Comment thread src/frontend/fit/fitmeasures/chi2.jl Outdated
Comment thread src/frontend/fit/fitmeasures/chi2.jl Outdated
Comment thread src/frontend/specification/Sem.jl Outdated
Comment thread src/loss/abstract.jl
@alyst alyst force-pushed the refactor_sem_terms branch from 0406f29 to 0096211 Compare March 26, 2026 00:41
@alyst alyst force-pushed the refactor_sem_terms branch from 33b9243 to 293c88b Compare March 31, 2026 00:57
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 31, 2026

Codecov Report

❌ Patch coverage is 73.12000% with 168 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.59%. Comparing base (1f8d2a9) to head (25dd38c).
⚠️ Report is 78 commits behind head on devel.

Files with missing lines Patch % Lines
src/frontend/specification/Sem.jl 60.82% 105 Missing ⚠️
src/frontend/fit/summary.jl 0.00% 21 Missing ⚠️
src/frontend/fit/standard_errors/bootstrap.jl 82.81% 11 Missing ⚠️
src/frontend/finite_diff.jl 60.00% 8 Missing ⚠️
src/objective_gradient_hessian.jl 84.61% 6 Missing ⚠️
src/frontend/pretty_printing.jl 0.00% 5 Missing ⚠️
src/frontend/fit/fitmeasures/fit_measures.jl 0.00% 3 Missing ⚠️
src/additional_functions/simulation.jl 90.90% 1 Missing ⚠️
src/additional_functions/start_val/start_fabin3.jl 80.00% 1 Missing ⚠️
src/frontend/fit/fitmeasures/CFI.jl 94.44% 1 Missing ⚠️
... and 6 more
Additional details and impacted files
@@            Coverage Diff             @@
##            devel     #317      +/-   ##
==========================================
+ Coverage   71.83%   72.59%   +0.75%     
==========================================
  Files          51       57       +6     
  Lines        2223     2485     +262     
==========================================
+ Hits         1597     1804     +207     
- Misses        626      681      +55     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@alyst
Copy link
Copy Markdown
Contributor Author

alyst commented Mar 31, 2026

@Maximilian-Stefan-Ernst I've added kwargs mechanism, and specifically recompute_observed_state kwarg (true by default) to the replace_observed() method.
It is designed to be a universal (not specific to a particular SemLoss) parameter that specifies, whether the elements of the Sem object have to update their internal states associated with the observed data to match the new observed, or they should preserve the old state associated with the original observed.
For SemWLS, it means updating the matrix weights using the new observed.
But it looks like for the bootstrap and SemWLS, it has to be recompute_observed_state=true (as it was originally) for the tests to pass.

I've also fixed the observed initialization for ensemble SEMs, that fixed the CFI failure, so now all the tests pass.

Comment thread src/loss/abstract.jl
Comment on lines +60 to +72
function replace_observed(loss::SemLoss, data::Union{AbstractMatrix, DataFrame}; kwargs...)
old_obs = SEM.observed(loss)
new_observed =
typeof(old_obs).name.wrapper(data = data, observed_vars = observed_vars(old_obs))
return replace_observed(loss, new_observed; kwargs...)
end

# non-SEM loss terms are unchanged
replace_observed(loss::AbstractLoss, ::Any; kwargs...) = loss

# LossTerm: delegate to inner loss
replace_observed(term::LossTerm, data; kwargs...) =
LossTerm(replace_observed(loss(term), data; kwargs...), id(term), weight(term))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can define these defaults only for losses in the package. If someone implements a new loss, they would have to define a method for replace_observed - this adds some overhead, but would act as a safety to avoid someone calling it on their own loss term (that might need updating) and getting wrong results.

Copy link
Copy Markdown
Contributor Author

@alyst alyst Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's nice ideas!

The default replace_observed(loss::SemLoss, ...) (L60) actually just calls the constructor for that SemLoss, which probably should take care of all loss-specific updates in most situtations.
That's why I have thought it is a good default.

But if you think replace_observed() might require a special logic more often, I can update the PR:

  • L60 would be renamed to default_replace_observed(loss::SemLoss)
  • a fallback replace_observed(loss::SemLoss) will throw a message instructing to implement replace_observed() for the concrete type, and suggesting SEM.default_replace_observed() as the candidate in simple cases
  • for SEM.jl losses that don't need special handling (SemML etc), replace_observed(loss::SemML, ...) = default_replace_observed(loss, ....)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds great!

One thing: I think for SemML we actually can/should have a specialized method, because nothing has to be updated if the observed variables don't change (which is now disallowed), and hessianeval should stay the same for the new model - I'll make a comment.

Comment thread src/frontend/specification/Sem.jl Outdated
Comment thread src/loss/ML/ML.jl
@@ -225,22 +235,3 @@ function non_posdef_return(par)
return typemax(eltype(par))
end
end
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
end
end
replace_observed(loss::SemML, ::SemObserved; kwargs...) = loss

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, all SEM loss terms (<:SemLoss) have to reference observed and implied.
Since replace_observed() replaces the observed object in the loss object, we have to create a new loss object, so the default replace_observed(loss::SemLoss, new_observed::SemObserved) method defined in abstract.jl should be fine here.

The semantics of the replace_observed(model::Sem) is to create a copy of the SEM model.
Since this is a public method, it could be called in different contexts, and allowing different models sharing the same subobjects can create problems (e.g. SemML can have its own internal state, preallocated matrices etc, so updating the parameters of the old and the new models may result in unexpected behaviour).
In general replace_observed() should not return the same object.
The idea was to keep it simple at the expense of some potential overhead for the bootstrap use case (many matrices created for the bootstrap copies of the model), but Julia's GC should handle it well, and intermediate matrices are created by evaluate!() in any case.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, my code suggestion does not make sense.

Re sharing subobjects

this is something we can not really avoid at the moment, because after calling replace_observed on a model, implied types share memory, and therefore, it is not save to fit models created this way in parallel. I think there are two options:

  1. copying everything when calling replace_observed , also implied objects
  2. copying as little as possible and expect the user to deepcopy models if they want to fit them in parallel.

I prefer option 2, because it gives more fine-grained control and is better for (serial) bootstrap.

Re SemML specifically

In line with the above I would prefer having SemML not copying it's internal matrices, and also keeping the hessianeval from the original model.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the design issue here is that at the moment there is no separation between storing the SEM model definition and storing the transient state required to calculate covariation matrices, objectives, gradients etc.
I.e. storing all intermediate matrices in SemImpliedState subtypes (managed by the specific subtype of SemImplied, e.g. RAMSymbolicState managed by RAMSymbolic etc) and keeping SemImplied objects and their fields truly immutable.
The same for SemLoss subtypes.
There could be pools of these states, and for calculations evaluate!() would acquire the object from the pool as needed, and release it back upon completion.
So semb = replace_observed(sema, newobserved) would create a shallow copy of sema sharing its implied terms (and their state pools) with semb,
it will also create new loss terms that only differ by the newobserved from the old ones, but will share the pools of internal states with the sema etc.
Otherwise I don't see how to cleanly manage it in the long run.
It was less of a problem before bootstrap became a more prominent feature, but for bootstrap with its potentially parallelized computations managing states becomes important.

Having said that, I have implemented similar approaches in the past -- it reduces the stress on GC in terms of the number of allocated objects and the amount of memory.
But it does not result in dramatic (x2) improvements of performance, because Julia's GC essentially does the same thing as the proposed design, but at the lower level:
even if replace_observed(sema) deep-copies sema, this copy is short-lived and gets efficiently collected once not used.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think a clean separation between states and the rest of the model is the clean solution in the long run - thank you for pointing that out, I did not think about that before - for now, I would suggest that since the imply types are not copied, we should stick to having the user deepcopy models when needed, and we can have SemML keep its internal matrices, and also keep the hessianeval.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Maximilian-Stefan-Ernst I've added a couple of commits based on this discussion:

  • now all SemLoss subtypes have 3-positional args ctor MySemLoss(observed, implied, refloss::MySemLoss = nothing; kwargs...).
    observed and implied were there already. The refloss is the 3rd optional argument. If it is provided, the new term will use the settings/internal state of this reference loss object, unless kwargs... override that.
    The docs are also updated.
    This is used by replace_observed(loss, new_observed), but will also allow in the future changing the implied model, while preserving the loss term settings (like vech or approx_hessian), e.g. adding or removing observed variables.
  • the semantic of replace_observed(loss, new_observed) is to create the new loss term that shares the internal state and implied with the original loss object.
    So repeated replace_observed() calls during bootstrap should not allocate new internal states.
  • replace_observed(sem, new_observed) was updated to make sure the types of the loss terms and parameters are preserved
  • both parallel and serial bootstrap deep-copy the Sem object (at the initialization) to avoid overwriting its internal state

I think that was the last remaining unresolved discussion for this PR.

@Maximilian-Stefan-Ernst
Copy link
Copy Markdown
Collaborator

Thanks a lot again, @alyst! Once the last open comments are resolved, I would be happy to merge.

I tried to activate Copilot, but since this repo is part of an organization, it seems to me like I would have to get Copilot for business, which is quite expensive. Not sure if there is another way.

Comment thread src/frontend/specification/Sem.jl Outdated
_subtype_info(::Sem) = ""
_subtype_info(::SemFiniteDiff) = " : Finite Difference Approximation"

function Base.show(io::IO, sem::AbstractSem)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some general considerations for show(sem::AbstractSem).

  • Julia recommends making show() as close to the code as possible with more verbose human-readable version implemented by show(io::IO, ::MIME"text/plain", ::T).
    This show() version does not show the actual type info, which I think is important for debugging etc.
    Given that there is also details() methods, maybe it's better to make show() more "coder-friendly" with more type information etc, and details() more "reader-friendly" or "math-friendly" with more details and more natural language. Also, we can define show(io::IO, ::MIME"text/plain", sem) = details(io, sem).
  • in line with that, _subtype_info() could be used for the details(), and show() could just show the actual type
  • it would be nice to show the total number of parameters (nparams()). I think it is an important aspect of the model that affects computations.
  • loss terms are printed as - Loss functions. I am personally more in favor of a "term" term, because it matches their role in the scalar SEM objective (elements of a sum), whereas functions suggest that there is some vector-based multi-objective.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, as I understand it, the output of show(io::IO, ::MIME"text/plain", ::T) is getting printed to the REPL. What is the actual use case of the other show method then?

I think I would keep the name Loss functions since it is a widely used name in numerical optimization. Wikipedia writes

In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function)[1] is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event.

so I think it should be clear that the output of the individual loss functions are real numbers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the output of show(io::IO, ::MIME"text/plain", ::T) is getting printed to the REPL. What is the actual use case of the other show method then?

Yes, REPL uses show(io::IO, ::MIME"text/plain", ::T), but if you call show(obj) explicitly, it will output a more terse code-compatible version, and print(obj) uses it as well:

julia> a = [1 2; 3 4]
2×2 Matrix{Int64}:
 1  2
 3  4

julia> show(a)
[1 2; 3 4]
julia> @show a;
a = [1 2; 3 4]

julia> print(a)
[1 2; 3 4]

julia> b
Dict{Symbol, Int64} with 2 entries:
  :y => 2
  :x => 1

julia> show(b)
Dict(:y => 2, :x => 1)

So whenever user programmatically prints the SEM object, it will use the default show().

I am not advocating for a full "code-like" output, but most show() implementations, including the text/plain variants, provide the actual type of the object.
It is very helpful, because it allows the user to search the documentation.

I think I would keep the name Loss functions since it is a widely used name in numerical optimization.

I think we have full alignment that what is being printed are "loss functions" individually.
In fact, when "function" is obvious from the context, it is often dropped (as is also done in the Wikipedia article).
My suggestion to use "terms" is to:

  • highlight that these are the components of the Sem objective function, which is a weighted sum of loss functions -- rather then just a collection of loss functions
  • align with the API, because it already uses LossTerm, sem_terms(), loss_terms() etc

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I did not know this distinction between different show methods. I added a method that is more in line with this idea, and what you previously had:

julia> show(model)
Structural Equation Model (Sem)
- 31 parameters
- Loss Functions:
  - SemML (75 samples, 11 observed, 3 latent variables) w=1

Let me know if you think thats a reasonable show method or also feel free to suggest something different - I am not currently using the explicit show call, so I don't have strong opinions on how it should look.

I would keep details separate from show(io::IO, ::MIME"text/plain", ::T), because I would potentially add more information to the details call in the future, and then it might be quite annoying to have all this information shown every time a model is shown.

I can see that in the context of multiple loss functions the term "term" is actually a bit better, but I would keep Loss Function for now to keep it more consistent with the previous version.

@alyst
Copy link
Copy Markdown
Contributor Author

alyst commented Apr 13, 2026

Thanks a lot again, @alyst! Once the last open comments are resolved, I would be happy to merge.

@Maximilian-Stefan-Ernst Thanks for the thorough review, as always!
I've fixed the SemWLS comment and responded to the replace_observed(loss::SemML, ...) one.
I've also added some comments about show(::Sem) -- let me know what you think.

I tried to activate Copilot, but since this repo is part of an organization, it seems to me like I would have to get Copilot for business, which is quite expensive. Not sure if there is another way.

I think you are right. Unfortunately, it looks like there's no way to just enable copilot for the repository to let the 3rd party collaborators with the copilot subscription use it.

@alyst alyst force-pushed the refactor_sem_terms branch from 2e45c7c to 25dd38c Compare May 4, 2026 04:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants