Skip to content

[HMA] Add Cache to banks to keep Content Type Counts #1816

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

juanmrad
Copy link
Collaborator

@juanmrad juanmrad commented Apr 25, 2025

Summary

I'm adding a new attribute to banks to keep track of the content they contain alongside the type of content.

This cache will be updated after each index operation to make sure the information is up to date as we build the index to prevent race conditions when adding new content.

Test Plan

Tested locally

And added to the hash bank page.

Screenshot 2025-04-24 at 8 53 25 PM

@github-actions github-actions bot added the hma Items related to the hasher-matcher-actioner system label Apr 25, 2025
Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be easier to talk through live - let me know if you want to chat through on discord. I think this metric is useful but am not yet sure about the implementation.

Top level comments:

  1. I have an impression based on the lifecycle of the equivalent tool at Meta that we are much better off explaining the contents of the bank in terms of photos/videos/text/etc than signal counts (though it's possible both are separately useful). How do you feel about doing a rollup on content type in addition to or instead of signal count?
  2. Should we look into a way where we could calculate the content of the bank use postgres counts? I'm hesitant on this approach since it can lead to live counts skewing until the next indexer run, though I think at least we could be confident we avoid other issues from trying to maintain a counter.

@@ -23,6 +24,15 @@
<tr id="bank_item_{{bank['name']}}">
<th scope="row" style="background-color: transparent;">{{ loop.index }}</th>
<td style="background-color: transparent;">{{ bank['name'] }}</td>
<td style="background-color: transparent;">
{% if bank['content_type_counts'] %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the signal counts are interesting to engineers, I think we might want to express this in terms of the underlying content that these signals can match.

E.g. rather than "100 pdq", I'm suggesting "100 photos".

This isn't 100% analogous, but I think it will better map to people's mental model, especially non-technical folks. We can add a tooltip to explain that this is an analogue on hover if we wanted to.

If you do have a strong feeling that you want to keep this to signal counts, then we should use "Signal" instead of "Content"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was that Signals and content may differ. You may enable signals later so old content may not be on all signal types.

I do think content makes sense, but that was the reasoning behind using signal over content type, I did the change after thinking over the name so definitively need to clean-up naming lol.

@@ -48,7 +51,7 @@ def bank_show_by_name(bank_name: str):
bank = storage.get_bank(bank_name)
if not bank:
abort(404, f"bank '{bank_name}' not found")
return jsonify(bank)
return {"name": bank.name, "matching_enabled_ratio": bank.matching_enabled_ratio}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not also return your new signal counts? They are cheap to fetch, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.

@@ -60,7 +63,11 @@ def bank_create():
enabled_ratio = flask_utils.str_to_type(data["enabled_ratio"], float)
elif "enabled" in data:
enabled_ratio = 1.0 if flask_utils.str_to_bool(data["enabled"]) else 0.0
return jsonify(bank_create_impl(name, enabled_ratio)), 201
bank = bank_create_impl(name, enabled_ratio)
return {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(applies to the other edited APIs) Consider explicitly typing the return, similar to what other endpoints are doing - or at least documenting what we expect this to return in the docstring

@@ -296,6 +296,8 @@ class BankConfig:
# 0.0-1.0 - what percentage of contents should be
# considered a match? Seeded by target content
matching_enabled_ratio: float
# Cache of content type counts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: This is currently implemented as signal type counts

@@ -349,6 +351,7 @@ class BankContentIterationItem:
signal_val: str
bank_content_id: int
bank_content_timestamp: int
bank_name: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How expensive is it to fetch this in order to add it to the output?

Comment on lines +621 to +622
.join(database.BankContent)
.join(database.Bank)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(answer to previous question) - two additional joins - does this appreciably impact the iteration speed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have index on the tables, but there's definitively an impact. I'll have to calculate at 10M records tho 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed hma Items related to the hasher-matcher-actioner system
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants