Skip to content

chore: unknown flags #9837

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 7, 2025
Merged

chore: unknown flags #9837

merged 3 commits into from
May 7, 2025

Conversation

nunogois
Copy link
Member

https://linear.app/unleash/issue/2-3406/hold-unknown-flags-in-memory-and-show-them-in-the-ui-somehow

This PR introduces a suggestion for a “unknown flags” feature.

When clients report metrics for flags that don’t exist in Unleash (e.g. due to typos), we now track a limited set of these unknown flag names along with the appnames that reported them. The goal is to help users identify and clean up incorrect flag usage across their apps.

We store up to 10 unknown flag + appName combinations, keeping only the most recent reports. Data is collected in-memory and flushed periodically to the DB, with deduplication and merging to ensure we don’t exceed the cap even across pods.

We were especially careful to make this implementation defensive, as unknown flags could be reported in very high volumes. Writes are batched, deduplicated, and hard-capped to avoid DB pressure.

No UI has been added yet — this is backend-only for now and intended as a step toward better visibility into client misconfigurations.

I would suggest starting with a simple banner that opens a dialog showing the list of unknown flags and which apps reported them.

image

Copy link

vercel bot commented Apr 24, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
unleash-docs ⬜️ Ignored (Inspect) Visit Preview May 7, 2025 10:39am

Copy link
Contributor

github-actions bot commented Apr 24, 2025

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@nunogois nunogois requested a review from Copilot April 24, 2025 23:29
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces an “unknown flags” feature to track and store unknown flag names reported by clients, including their source applications. The key changes are the addition of a new migration for the unknown_flags table, updates to the store, service, and routing layers, and corresponding tests and fixtures to support the in-memory tracking and periodic flushing of this data.

Reviewed Changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/test/fixtures/store.ts Added FakeUnknownFlagsStore and integrated it into the store.
src/server-dev.ts Enabled the new reportUnknownFlags flag in the config.
src/migrations/20250424185110-unknown-flags.js Added migration to create the unknown_flags table.
src/lib/types/* Updated store, service, and experimental types to include unknown flags.
src/lib/services/index.ts Instantiated and passed the UnknownFlagsService in services creation.
src/lib/routes/admin-api/metrics.ts Added a new route with UnknownFlagsController for retrieving unknown flags.
src/lib/features/scheduler/schedule-services.ts Scheduled flush and clear tasks for the unknownFlagsService.
src/lib/features/metrics/unknown-flags/* Added store, service, controller, and fake store for unknown flags.
src/lib/features/metrics/client-metrics/metrics-service-v2.ts Enhanced filtering logic and registration of unknown flags in client metrics reporting.
src/lib/features/metrics/client-metrics/metrics-service-v2.test.ts Updated tests to incorporate the UnknownFlagsService.

Copy link
Member

@chriswk chriswk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good. Have an approval! Small musing: Have we stopped using dbTimer? I notice the last couple of stores there is no timer being used

@github-project-automation github-project-automation bot moved this from New to Approved PRs in Issues and PRs Apr 25, 2025
Copy link
Contributor

@gastonfournier gastonfournier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks promising

Comment on lines +27 to +38
await this.db.transaction(async (tx) => {
await tx(TABLE).delete();
if (flags.length > 0) {
const rows = flags.map((flag) => ({
name: flag.name,
app_name: flag.appName,
seen_at: flag.seenAt,
}));
await tx(TABLE).insert(rows);
}
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is to keep only 10 when we store... the latest 10... but I'm not convinced with the delete all... cause it needs to lock the whole table...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s true that multiple pods might flush around the same time, which means concurrent reads and writes are possible. That said, the current approach is designed to handle that safely. Each pod fetches the full current state from the database before flushing, merges it with its local in-memory cache, deduplicates by name + appName, and keeps only the 10 most recent entries. So even if two pods flush close together, the end result is still consistent — just depending on which write lands last.

The DELETE + INSERT is intentional. Since the table is capped at 10 rows and flushes are infrequent, this approach keeps the logic simple and avoids the need for upserts or row-level diffs.

Also worth noting: the only reason we’re storing this in the DB at all is to ensure persistence across pod restarts and availability across pods. If this were purely local state, the in-memory cache would be enough.

If the feature grows in scope, we can revisit the update strategy, but for now this tradeoff feels right.


async clear(hoursAgo: number): Promise<void> {
return this.db(TABLE)
.whereRaw(`seen_at <= NOW() - INTERVAL '${hoursAgo} hours'`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think seen at is not the best to clear flags, because it might just keep being updated if it's a frequently accessed flag... maybe use created at

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think seen_at is the right field here, since what we care about is whether the flag has been reported recently. If an unknown flag keeps getting reported, it’s still relevant — so we want to keep it in the list. Using created_at would remove flags that are still active just because they were first seen a while ago, which feels like the wrong signal for this use case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea of created_at might still be valuable. It will tell you the "age" or how long ago this flag was first seen (if we don't remove them). Which makes me think... maybe we should not delete all and recreate... instead, what if we make a count on the table and we only insert up to 10-count?

We should still delete flags that stopped appearing in the "unknown" group from the table, so it adds some complexity, so let's leave it as an idea here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not convinced of what created_at really gives us here. What's the benefit you're seeing? How valuable is it to know when an unknown flag was first seen? And are the first 10 really more valuable than the last 10?

I'll leave it for now, but I'm curious about your point of view here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to leave it for now. I'm thinking of differentiating the 2 scenarios we discussed:

  1. You made a typo or accidentally referenced a flag that doesn't exist. This will stay in code for a long time and therefore the difference between created_at (or when it was first registered as unknown) and seen_at should be large.
  2. Your normal workflow implies releasing the flag to prod first and then creating it in Unleash. In this case, the difference between created_at and last_seen should be at max a few hours to 1 or 2 days.

Keeping the created_at as "first time the flag was registered as unknown" comes with challenges, which is why I'm ok dropping this, but I hope my point is clearer now.

@nunogois
Copy link
Member Author

nunogois commented May 5, 2025

The code looks good. Have an approval! Small musing: Have we stopped using dbTimer? I notice the last couple of stores there is no timer being used

Can't find dbTimer in our source code, but I think you mean the timer some stores have for some of their methods. I can add it here, if you think it adds value.

@nunogois nunogois force-pushed the chore-unknown-flags branch from 293609c to 67aa130 Compare May 6, 2025 16:55
@nunogois
Copy link
Member Author

nunogois commented May 6, 2025

Discussed with @gastonfournier and we're interested in gathering some data before we invest more time on this feature (e.g., UI).

67aa130 adds a new gauge for the total of unknown flags seen in the last 24 hours, if any (max of 10).

image

reportUnknownFlags is also no longer dependent on filterExistingFlagNames. We've also confirmed that unknown flags includes invalid flags:

image

@nunogois nunogois requested a review from gastonfournier May 6, 2025 17:05
Copy link
Contributor

@gastonfournier gastonfournier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I'm still thinking on what the behavior should be... Also, what'd happen when multiple instances try to write to the db, what happens if an instance has seen a flag that doesn't exist in Unleash but another instance has not seen it (running the flush in a single instance would miss this unknown flag)?

Overall, this will be a nice first step, we don't have to have all the answers atm.

Comment on lines +168 to +170
const validatedToggleNames = await this.filterValidToggleNames(
toggleNamesToValidate,
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking: filtering valid toggle names seems to be a responsibility of unknown flag service more than metrics-service. I know there are practical reasons to keeping this logic here, so not something I'm asking to change but just a reflection

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already a responsibility of this method before this PR: https://github.com/Unleash/unleash/blob/b322afb097e00f71be1429c378153c97442b724e/src/lib/features/metrics/client-metrics/metrics-service-v2.ts#L134C56-L134C57

I'm guessing because we're not interested in registering metrics for invalid flag names.

It doesn't make much sense to move it over to the new unknown flags service, at least for now, especially since it's behind a flag.

Comment on lines +199 to +203
schedulerService.schedule(
unknownFlagsService.flush.bind(unknownFlagsService),
minutesToMilliseconds(2),
'flushUnknownFlags',
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea, cause I didn't see us using this feature in our scheduler service, and you may have a good use case here:

Suggested change
schedulerService.schedule(
unknownFlagsService.flush.bind(unknownFlagsService),
minutesToMilliseconds(2),
'flushUnknownFlags',
);
const minutes = 2;
schedulerService.schedule(
jobService.singleInstance('unknownFlagsFlushJob',
unknownFlagsService.flush.bind(unknownFlagsService),
minutes
),
minutesToMilliseconds(minutes),
'flushUnknownFlags',
);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't want this to be a single instance job, right?

Each Unleash instance (pod) will receive unknown flags as part of metrics and store them in their own cache, so it makes sense that each Unleash instance (pod) has their own job to flush that cache into the DB.

If anything we could do it for the "clear", but I don't think doing that gives us much either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wanted to comment on the clear one :D but yeah, maybe it's not going to give us much, it was just because I didn't see us using it and it came up in some conversations last week with either @kwasniew @sjaanus or @sastromskis. That we should have the ability of electing a leader for one of these scheduled jobs and I remember we had that already

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I didn't remember / wasn't aware of it being a thing in our codebase until now, so thank you for the reminder.

@nunogois
Copy link
Member Author

nunogois commented May 7, 2025

Also, what'd happen when multiple instances try to write to the db, what happens if an instance has seen a flag that doesn't exist in Unleash but another instance has not seen it (running the flush in a single instance would miss this unknown flag)?

That should be covered by the fact that we're merging the local unknown flags cache with the ones already registered in the DB when we flush: https://github.com/Unleash/unleash/pull/9837/files#diff-544e3c7c0bac8e29a5c0954c0b046e580ad6b11eca94dcb097fac0eedf7a8b7eR61-R79

@gastonfournier
Copy link
Contributor

Also, what'd happen when multiple instances try to write to the db, what happens if an instance has seen a flag that doesn't exist in Unleash but another instance has not seen it (running the flush in a single instance would miss this unknown flag)?

That should be covered by the fact that we're merging the local unknown flags cache with the ones already registered in the DB when we flush: https://github.com/Unleash/unleash/pull/9837/files#diff-544e3c7c0bac8e29a5c0954c0b046e580ad6b11eca94dcb097fac0eedf7a8b7eR61-R79

Does this mean we never clean? Imagine a flag 'I_do_not_exist', you add it to the db. When is it deleted? If the answer is the clear job every 24hs, that works for me :)

@nunogois
Copy link
Member Author

nunogois commented May 7, 2025

Does this mean we never clean? Imagine a flag 'I_do_not_exist', you add it to the db. When is it deleted? If the answer is the clear job every 24hs, that works for me :)

Yes, it is. We automatically clean old unknown flags due to our 10 max logic and the 24h clean up job that cleans unknown flags seen >24h ago.

@nunogois nunogois force-pushed the chore-unknown-flags branch from 67aa130 to 36cf216 Compare May 7, 2025 10:39
@nunogois nunogois merged commit eb238f5 into main May 7, 2025
11 checks passed
@nunogois nunogois deleted the chore-unknown-flags branch May 7, 2025 10:48
@github-project-automation github-project-automation bot moved this from Approved PRs to Done in Issues and PRs May 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants