Skip to content

III-6975 Fix replay of role created#2153

Open
grubolsch wants to merge 3 commits intomasterfrom
III-6975/fix-replay-role-added
Open

III-6975 Fix replay of role created#2153
grubolsch wants to merge 3 commits intomasterfrom
III-6975/fix-replay-role-added

Conversation

@grubolsch
Copy link
Contributor

Fixed

  • When trying to replay a role created, you got a duplicate key error. Has a unit test to prove it works.

Ticket: https://jira.uitdatabank.be/browse/III-6975

@lucwollants
Copy link
Contributor

Nice find, but I'm not sure we need to handle it in this way. There are even more projections that are stored inside MySQL tables that have the same issue. There used to be a page on Confluence about this, with a list of tables to clear.

It's a known issue that we need to clear the read models before doing a replay. We also need to clean Redis before a replay.

@grubolsch
Copy link
Contributor Author

I am worried about this bug, I am wondering how many other events there are we cannot replay.

@grubolsch
Copy link
Contributor Author

I think each event should be able to be replayed individually, so I don't see the problem in this bugfix.

@bertramakers
Copy link
Contributor

bertramakers commented Feb 11, 2026

There are even more projections that are stored inside MySQL tables that have the same issue. There used to be a page on Confluence about this, with a list of tables to clear.

It's a known issue that we need to clear the read models before doing a replay. We also need to clean Redis before a replay.

This is technically correct, if we do a complete replay we need to clear the read models first because a lot of them are not programmed to handle duplicate key constraints. This "purge" is handled with a specific command for the MySQL tables, so we don't need to keep a list of those tables in Confluence and truncate them manually:

class PurgeModelCommand extends Command

This specific PR will not fix that for complete replays because this only makes one read model flexible enough to handle duplicates. Additionally I think we will always need to do complete replays on an environment separate from production, because they take too long and the read models would be outdated until the replay is completely finished. So it probably doesn't really matter that the read models need to be emptied in the case of complete replays, except that the person doing the replay needs to remember to run that purge command first (on the replay instance).

In the context of the ticket however it seems like we only need to replay a specific role or specific list of roles. There are two ways we can handle this:

  1. Update the relevant projector(s) to better handle the duplicate key constraint so we can replay those role(s) without errors - aka the suggested changes in this PR
  2. Delete the relevant row(s) of those role(s) and then do the replay of those role(s)

I would try to compare the risks and benefits of these two regardless of the situation of complete replays.

Option 1:

  • ✅ No manual work in the production DB = less/zero chance of accidentally deleting or altering incorrect rows
  • ⚠️ Make sure to double check that this change does not introduce unintended side effects in the read model aside from replays. In other words, does it make sense that it also updates an existing role when a new one is created outside of a replay? This could cover up other bugs that cause new roles to be created with existing IDs and incorrectly overwrites data. Maybe it would be safer to first catch the exception, and then check if we're doing a replay before either updating or re-throwing the exception? We should have a flag on domain messages to indicate whether they are sent from a replay or not:
    class DomainMessageIsReplayed implements DomainMessageSpecificationInterface
    (Double-check that this still works as expected though!)
  • ⚠️ Probably requires thorough testing on acc or test before pushing this to prod, which might take longer to achieve your goal compared to manually deleting the specific rows and re-running the replay command without pushing code changes

Option 2:

  • ⚠️ Manual DB changes on prod = high risk
  • ✅ No code changes = 0% chance of new bugs or unintended side effects in the new code
  • ✅ Probably faster than deploying and testing code changes

If time is not an important factor I would be inclined to go for option 1 just to avoid manual changes to the prod DB, BUT with the caveat that we need to be sure that there are no unintended side effects.

As a bonus option 1 could also be useful in the context of complete replays to get rid of the purge command completely in the long run, but I'm not sure yet if that would be a real improvement because I imagine that @willaerk has included that in some kind of script he runs for replays and it's probably not something he has to actively think about. If it is a manual step however, which could be overlooked, it could be helpful to make the projectors smart enough to handle duplicate keys in the context of replays so we can skip that purge step completely.

@bertramakers
Copy link
Contributor

bertramakers commented Feb 11, 2026

After some more investigation with @grubolsch we discovered the actual issue is not with the role read model, but with the "user roles" read model in Redis. This is used to show the roles of a user on the user detail page in UDB, and also for the permission checks it seems. The projector for that read model does not take RoleDeleted into account:

class UserRolesProjector extends RoleProjector

So this PR will not solve the specific issue mentioned in the ticket. TBD if it's still useful or not to invest time in this particular PR (feedback + deploy + testing). Probably not high priority.

As a quick fix for the actual problem, @grubolsch is going to check with @willaerk how to manually clean up the roles of the user that is experiencing issues.

The complete fix is quite involved:

  • UserRolesProjector needs to take RoleDeleted into account, and it will need to use either the user_roles table or JSON projection of a role to find all users related to the deleted role, so it will also be crucial that the UserRolesProjector processes the RoleDeleted before that relationship info is deleted
  • All RoleDeleted messages will need to be replayed so they are correctly processed by the UserRolesProjector

Or alternatively, the code needs to be reworked to e.g. only work with a single user_roles table and not have 3 read models for the user & roles relationships.

While important to fix correctly, either solution will take too much time to fix the current situation with the user that is not able to work in UDB because their amount of roles is too big and the permission checks are broken.

@lucwollants
Copy link
Contributor

We have some nice follow up tickets https://jira.publiq.be/browse/III-7029 and https://jira.publiq.be/browse/III-7030, so I would wait with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants