Skip to content

Conversation

@marienz
Copy link

@marienz marienz commented Oct 28, 2025

This fixes the issue described in #345: using an Agenix secret as hashedPasswordFile with userborn enabled does not work.

Tested locally.

Please let me know if you prefer a different approach for installSecrets causing a non-zero exit status (the added "true" is admittedly a hack but it's pretty easy to reintroduce the bug otherwise...)

modules/age.nix Outdated
systemd.services.agenix-chown = mkIf sysusersEnabled {
wantedBy = [ "sysinit.target" ];
# Change ownership and group after users and groups are made.
after = [ "systemd-sysusers.service" ];
Copy link

@thattomperson thattomperson Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i understand correctly; shouldn't this be after = [ "agenix-install-secrets.service" ];?

I did not understand correctly

what you have is correct 👍

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...actually I think you have a point :)

I think what I have is correct as long as systemd-sysusers.service is actually enabled. Since we're currently in a mkIf sysusersEnabled block, that's probably fine.

But if these units are ever used without systemd-sysusers.service, the before/after settings will have no effect. In that case I'd expect systemd to run agenix-chown in parallel with agenix-install-secrets, which... would not go over well.

So I'll add a commit to explicitly make agenix-chown go "after" agenix-install-secrets.

@thattomperson
Copy link

Thanks @marienz, I thought i was going insane 😅

@marienz marienz force-pushed the main branch 2 times, most recently from bbcc1b4 to 9ea8e6f Compare October 31, 2025 12:29
@marienz
Copy link
Author

marienz commented Oct 31, 2025

Bah, I missed another problem (fixed with the commit I just added): we need to restart the new chown service to chown new agenix generations...

Fix seems to work but may not be the best (a systemd expert I am not).

This fixes referencing an agenix secret from a user hashedPasswordFile.

nixpkgs makes systemd-sysusers.service an alias for userborn.service.
With all of agenix-install-secrets.service running after that, userborn
fails to read hashedPasswordFile if that is an agenix secret.

Fix this by taking the same approach the activation script took: chown
secrets after systemd-sysusers, install them before.

The chown script depends on _agenix_generation being set, which it still
is from newGeneration when all scripts are part of the same activation
script but is not when running as a separate service, so break that out.

End the systemd scripts in "true". This is necessary for agenix-install,
because it otherwise ends in

```
(( _agenix_generation > 1 )) && {
echo "[agenix] removing old secrets (generation $(( _agenix_generation - 1 )))..."
rm -rf "/run/agenix.d/$(( _agenix_generation - 1 ))"
}
```

...which leaves the exit status of the last command non-zero if this is
the first generation. This has no effect for snippets included in the
Nix activation script, but if it's the end of one of our systemd units
it makes it fail.

It is currently not necessary for the chown service, but since it is
easy to introduce a similar problem unintentionally, end that in "true"
too.

Fixes ryantm#345
...after agenix-install-secrets, even if systemd-sysusers.service is
disabled.

Currently we should only use these units if systemd-sysusers.service is
used (directly or by having userborn aliased to it). But better safe
than sorry: with no ordering constraint systemd may run both in
parallel...
If agenix-install-secrets.service is restarted, it will otherwise
install a new generation whose secrets are not chown'd.

(It would be better to chown them before the /run/agenix symlink is
updated, but I'm not sure how to do that cleanly, and this should be
equivalent to what the non-systemd scripts do.)
@marienz
Copy link
Author

marienz commented Dec 20, 2025

I don't know why CI failed there: the force push was a noop rebase (to pick up the checkin and revert of decrypt-parallel, just to make the status of this PR more green), it passed before, it still passes locally, and CI is not logging enough to diagnose (just "last 10 log lines" of a VM test that may have gotten stuck on something).

I've been running this locally and not noticed any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants