Skip to content

Conversation

@cneira
Copy link

@cneira cneira commented Nov 26, 2025

I modified the access keys schema to support temporary keys, their deletion, and the fields required for Secure Token Services (STS) and Identity and Access Management (IAM). I added a script to delete the temporary keys, but my first approach would be to include this in a crontab, but I'm not sure at the moment is that the best approach.

travispaul and others added 17 commits October 24, 2025 20:54
- Update accesskey schema to include: updated, description, and status.
- Update ENG submodule.
Adds CLI tool to manually clean up expired temporary credentials.
Fixes ldapjs search to use event-based pattern and adds rate
limiting (5 concurrent deletes) to prevent Moray overload.
Modifies schema validation to pass operation type and skip
accesskey validation on delete, allowing removal of expired
credentials. Includes documentation in docs/temporary-credentials.md.

Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@cneira cneira requested a review from a team November 26, 2025 22:19
@travispaul
Copy link
Member

I see you pulled in commits from #29 too! Would you mind pulling in the latest commits? I suspect they'll merge pretty cleanly: MANTA-5485...TRITON-2513

I was just about to mark that PR non-draft too so if those changes get pulled in here we can simply close #29 altogether!

});
}

function startCleanupInterval(ufdsClient, log, intervalMs) {
Copy link
Member

@travispaul travispaul Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of setInterval here gives me pause, especially since a network call is involved as setInterval will not wait for the last call to complete before proceeding and can lead to more open connections than expected in some error/edge cases.

Using setTimeout here might be better, for example:

function startCleanupInterval(ufdsClient, log, intervalMs) {
    intervalMs = intervalMs || 5 * 60 * 1000; // Default 5 minutes

    log.info({intervalMs: intervalMs}, 'Starting credential cleanup interval');

    function runCleanup() {
        setTimeout(function () {
            cleanupExpiredCredentials(ufdsClient, log, function (err) {
                if (err) {
                    log.error(err, 'Credential cleanup failed');
                }
                // only run next iteration after the last call has finished
                runCleanup();
            });
        }, intervalMs);
    }

    runCleanup();
}

This is more difficult/cumbersome to cancel though, especially if you don't want it to run forever...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... esp. when you're launching this via cron(1) every 12 hours?!?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: I thought this was a run-once where it started again by cron(1) every 12 hours. Why are we interval or timeout running this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I last reviewed this, I assumed it was because we were searching the keys in batches of a limited size and needed to re-run the function until all the expired keys were cleaned up but looking at this again I don't see the use of ldapjs' sizeLimit parameter in the search opts which I think we want here.

@travispaul
Copy link
Member

travispaul commented Dec 1, 2025

Do we need to expose temporary key functionality in CloudAPI and node-triton? What's the intended process for obtaining one of these newer tokens?

Copy link
Contributor

@danmcd danmcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass 0, didn't find much, but it is pass-0 while remote-remote. :)

@cneira cneira requested review from danmcd and travispaul December 4, 2025 20:00
@cneira cneira marked this pull request as ready for review December 5, 2025 12:43
@cneira
Copy link
Author

cneira commented Dec 5, 2025

I see you pulled in commits from #29 too! Would you mind pulling in the latest commits? I suspect they'll merge pretty cleanly: MANTA-5485...TRITON-2513

I was just about to mark that PR non-draft too so if those changes get pulled in here we can simply close #29 altogether!

Done!

Copy link
Member

@travispaul travispaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some non-blocking nits but otherwise testing has went well for me.

I authored some of the commits here, so not sure how much to value my approval.

Copy link
Member

@travispaul travispaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed some temporary keys from my tests last week were never purged. The cronjob doesn't have node in its path:

[root@b0be2660-17e4-46b7-95aa-53d8116940ae (coal:ufds0) ~]# cat /var/log/ufds-cleanup.log

/usr/bin/env: node: No such file or directory

The script itself works fine because /opt/smartdc/ufds/build/node/bin is in $PATH however that path is not present in /etc/default/cron

@cneira
Copy link
Author

cneira commented Dec 9, 2025

I noticed some temporary keys from my tests last week were never purged. The cronjob doesn't have node in its path:

[root@b0be2660-17e4-46b7-95aa-53d8116940ae (coal:ufds0) ~]# cat /var/log/ufds-cleanup.log

/usr/bin/env: node: No such file or directory

The script itself works fine because /opt/smartdc/ufds/build/node/bin is in $PATH however that path is not present in /etc/default/cron

Thank you! I'm sorry I did not catch this sooner. I'll fix it now.

@cneira cneira requested a review from travispaul December 9, 2025 17:14
travispaul
travispaul previously approved these changes Dec 9, 2025
Add tests for temporary access keys

Co-Authored-By: Claude Sonnet 4.5
<[email protected]>
travispaul
travispaul previously approved these changes Dec 10, 2025
Copy link
Contributor

@danmcd danmcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I proceed futher, I want to make sure I understand how things work here. I'm worried about expired-ones-still-being-usable. I don't think that's the case, and I stopped reviewing here to reality check that.

Under heavy load there might be another problem, but I'll revisit that as I continue reviewing.

README.md Outdated
- You've got a moray instance running exactly how it's specified on the
config file `etc/config.coal.json`.
- Your node version is greater than 0.8.
- Your node version is greater than v6.7.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greater than or equal to? Or greater than? NODE_PREBUILT is 6.17.1, but this seems to be for people with more recent Node on their development environment.

Copy link
Member

@travispaul travispaul Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greater than or equal to is more accurate, equal to might even be best. I personally try to always use the same version specified in NODE_PREBUILT

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will resolve upon fix.

# Setup crontab for expired credentials cleanup
echo "Setting up crontab for expired credentials cleanup"
crontab -l 2>/dev/null | grep -v cleanup-expired-creds > /tmp/crontab.$$ || true
echo "0 */12 * * * /opt/smartdc/$role/bin/cleanup-expired-creds -c /opt/smartdc/$role/etc/config.json >> /var/log/ufds-cleanup.log 2>&1" >> /tmp/crontab.$$
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this as a 12-hour cleanup interval, i.e. a cleanup pass every 12 hours. This could mean an expired temporary credential can have up to 11h59m59s of lifetime beyond its expiration? Or is this merely a garbage collector and that UFDS will not honor temporary credentials that expired, but have not yet been reaped?

I understand that if UFDS doesn't reality-check-and-reject an expired credential, we have to trade off between overwhelming UFDS with cleanup jobs, and keeping the still-good-after-expired window to a bare minimum.

Another thing, especially on a loaded system with a small interval, is the "oh no, my cleanup I'm running is taking so long the next interval would happen now!"

There are solutions for these, but we need to be mindful about what they cost, and how much burden they place on a UFDS deployment. This is where having it with Really Big Loads (TM) will be telling.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this as a 12-hour cleanup interval, i.e. a cleanup pass every 12 hours. This could mean an expired temporary credential can have up to 11h59m59s of lifetime beyond its expiration? Or is this merely a garbage collector and that UFDS will not honor temporary credentials that expired, but have not yet been reaped?

The cadence of 12 hours for clean-up was arbitrarily chosen because that's the maximum amount of time a temporary credential is valid. The cleanup script also validates the timestamp to determine whether the access key will be deleted or not. Access keys that are expired will be blocked in Mahi. The intent here is to keep only the valid keys in UFDS and Mahi cache, so this does not grow indefinitely, as users could create several temporary keys as part of their workflows.

I understand that if UFDS doesn't reality-check-and-reject an expired credential, we have to trade off between overwhelming UFDS with cleanup jobs and keeping the still-good-after-expired window to a bare minimum.

We could reduce the periodicity of the cleanup process to just per day or any other cadence that does not impact the service time, but the more we wait for the cleanup more data is accumulated and the more cpu time is required for the cleanup.

Another thing, especially on a loaded system with a small interval, is the "oh no, my cleanup I'm running is taking so long the next interval would happen now!"

You are totally right, I have seen this pattern in the past. We should add a check to not execute if there is already running cleanup process executing.

There are solutions for these, but we need to be mindful about what they cost, and how much burden they place on a UFDS deployment. This is where having it with Really Big Loads (TM) will be telling.

We will need to think in terms of capacity planning to have some rough estimates on the cost of this clean-up for real-world loads. Today, users don't have a limit on the number of roles + policies that they could create, so I think now is a good time to consider if we need to limit the roles that user could create by STS. Looking at the AWS documentation they already limit that https://docs.aws.amazon.com/general/latest/gr/iam-service.html to 1000 roles per region ( per account). and 10 policies per role.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Access keys that are expired will be blocked in Mahi." => okay good so enforcement happens apart from this, which I'm viewing as garbage-collection.

Your 12-hour interval makes sense for a garbage collector. I can imagine it shrinking to less (8 hours?), but not expanding.

And checking for a running process is a good idea. We'll get to that further below.

Copy link
Contributor

@danmcd danmcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to document or make clear that:

  • Temporary keys have lifetimes that will be enforced by SOMETHING, where you spell out SOMETHING. I see examples of such enforcement points (e.g. lookups failing if the expiration is insufficient), but they need to be spelled out.
  • Your 12-hour cron job and delete-expired-keys scripts are more about garbage collection (removing expired items) and less about security enforcement.

If my understanding of these two distinct concepts does not apply to what you're doing here, we need to sort this out.

});
}

function startCleanupInterval(ufdsClient, log, intervalMs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... esp. when you're launching this via cron(1) every 12 hours?!?

t.notOk(err, 'should allow modifying mutable fields');
t.end();
})
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to add a test where you:

  • Create a temporary key with a SHORT lifetime (60s)

  • Confirm it gets added AND you can use it immediately.

  • Wait >= SHORT lifetime (i.e. >= 60s)

  • Try using or seeking it again; it should STOP WORKING.

@danmcd danmcd requested a review from travispaul December 15, 2025 18:20
Copy link
Contributor

@danmcd danmcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the replies. I have more unanswered, and a new question that needs answering.

README.md Outdated
- You've got a moray instance running exactly how it's specified on the
config file `etc/config.coal.json`.
- Your node version is greater than 0.8.
- Your node version is greater than v6.7.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will resolve upon fix.

# Setup crontab for expired credentials cleanup
echo "Setting up crontab for expired credentials cleanup"
crontab -l 2>/dev/null | grep -v cleanup-expired-creds > /tmp/crontab.$$ || true
echo "0 */12 * * * /opt/smartdc/$role/bin/cleanup-expired-creds -c /opt/smartdc/$role/etc/config.json >> /var/log/ufds-cleanup.log 2>&1" >> /tmp/crontab.$$
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Access keys that are expired will be blocked in Mahi." => okay good so enforcement happens apart from this, which I'm viewing as garbage-collection.

Your 12-hour interval makes sense for a garbage collector. I can imagine it shrinking to less (8 hours?), but not expanding.

And checking for a running process is a good idea. We'll get to that further below.

});
}

function startCleanupInterval(ufdsClient, log, intervalMs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: I thought this was a run-once where it started again by cron(1) every 12 hours. Why are we interval or timeout running this?

if (parsed.help) {
usage(0);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right here is where you'd wanna check for a running process. I'd suggest something like a /var/run/ufds-cleanup-expired-creds.pid file or some such thing. There's prior-art on this in illumos's usr/src/cmd utilities, and maybe JS has some sort of library that'll do the right thing (without breaking our older-node bias).

Copy link
Contributor

@danmcd danmcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point I think I'm waiting for garbage-collection (cleanup-expired-creds and friends) comments to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants