-
Notifications
You must be signed in to change notification settings - Fork 327
[HMA] Add Cache to banks to keep Content Type Counts #1816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[HMA] Add Cache to banks to keep Content Type Counts #1816
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be easier to talk through live - let me know if you want to chat through on discord. I think this metric is useful but am not yet sure about the implementation.
Top level comments:
- I have an impression based on the lifecycle of the equivalent tool at Meta that we are much better off explaining the contents of the bank in terms of photos/videos/text/etc than signal counts (though it's possible both are separately useful). How do you feel about doing a rollup on content type in addition to or instead of signal count?
- Should we look into a way where we could calculate the content of the bank use postgres counts? I'm hesitant on this approach since it can lead to live counts skewing until the next indexer run, though I think at least we could be confident we avoid other issues from trying to maintain a counter.
@@ -23,6 +24,15 @@ | |||
<tr id="bank_item_{{bank['name']}}"> | |||
<th scope="row" style="background-color: transparent;">{{ loop.index }}</th> | |||
<td style="background-color: transparent;">{{ bank['name'] }}</td> | |||
<td style="background-color: transparent;"> | |||
{% if bank['content_type_counts'] %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the signal counts are interesting to engineers, I think we might want to express this in terms of the underlying content that these signals can match.
E.g. rather than "100 pdq", I'm suggesting "100 photos".
This isn't 100% analogous, but I think it will better map to people's mental model, especially non-technical folks. We can add a tooltip to explain that this is an analogue on hover if we wanted to.
If you do have a strong feeling that you want to keep this to signal counts, then we should use "Signal" instead of "Content"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought was that Signals and content may differ. You may enable signals later so old content may not be on all signal types.
I do think content makes sense, but that was the reasoning behind using signal over content type, I did the change after thinking over the name so definitively need to clean-up naming lol.
@@ -48,7 +51,7 @@ def bank_show_by_name(bank_name: str): | |||
bank = storage.get_bank(bank_name) | |||
if not bank: | |||
abort(404, f"bank '{bank_name}' not found") | |||
return jsonify(bank) | |||
return {"name": bank.name, "matching_enabled_ratio": bank.matching_enabled_ratio} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not also return your new signal counts? They are cheap to fetch, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right.
@@ -60,7 +63,11 @@ def bank_create(): | |||
enabled_ratio = flask_utils.str_to_type(data["enabled_ratio"], float) | |||
elif "enabled" in data: | |||
enabled_ratio = 1.0 if flask_utils.str_to_bool(data["enabled"]) else 0.0 | |||
return jsonify(bank_create_impl(name, enabled_ratio)), 201 | |||
bank = bank_create_impl(name, enabled_ratio) | |||
return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(applies to the other edited APIs) Consider explicitly typing the return, similar to what other endpoints are doing - or at least documenting what we expect this to return in the docstring
@@ -296,6 +296,8 @@ class BankConfig: | |||
# 0.0-1.0 - what percentage of contents should be | |||
# considered a match? Seeded by target content | |||
matching_enabled_ratio: float | |||
# Cache of content type counts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: This is currently implemented as signal type counts
@@ -349,6 +351,7 @@ class BankContentIterationItem: | |||
signal_val: str | |||
bank_content_id: int | |||
bank_content_timestamp: int | |||
bank_name: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How expensive is it to fetch this in order to add it to the output?
.join(database.BankContent) | ||
.join(database.Bank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(answer to previous question) - two additional joins - does this appreciably impact the iteration speed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have index on the tables, but there's definitively an impact. I'll have to calculate at 10M records tho 🤔
Summary
I'm adding a new attribute to banks to keep track of the content they contain alongside the type of content.
This cache will be updated after each index operation to make sure the information is up to date as we build the index to prevent race conditions when adding new content.
Test Plan
Tested locally
And added to the hash bank page.