Skip to content

Conversation

guusdk
Copy link
Member

@guusdk guusdk commented Jan 14, 2025

This commit provides a basic implementation of XEP-0377: Spam Reporting

The changes include:

  • persistent storage of spam reports
  • a provider implementation to allow other spam reporting persistence providers
  • an event listening mechanism
  • an optional notification of admins

@guusdk guusdk force-pushed the OF-2954_Spam-Reporting branch from abbd110 to d4f9b04 Compare January 14, 2025 22:18
@guusdk
Copy link
Member Author

guusdk commented Jan 14, 2025

This isn't quite ready yet, but seems to be functional.

Pending work:

  • database install script modification for all databases except HSQLDB
  • database update scripts for all databases
  • make the provider implementation non-static (it hard-codes to the 'default' provider now)
  • forward to originating server (note: warn about end-user consent to do this)
  • forward to xmppbl (note: warn about end-user consent to do this)
  • add support for including the original, offending stanza (eg: urn:xmpp:forward:0)

@guusdk guusdk force-pushed the OF-2954_Spam-Reporting branch 3 times, most recently from b6e74e7 to 83fb015 Compare January 15, 2025 10:22
reporter VARCHAR(1024) NOT NULL,
reported VARCHAR(1024) NOT NULL,
reason VARCHAR(255) NOT NULL,
created BIGINT NOT NULL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may use timestamp

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, but not for now: We've had trouble before finding a consistent way to represent a timestamp in all of the databases that we support, which is why we use a number instead. I'm not sure if this is still as impossible as it was in 2004, by the way, but I'd like Openfire to be consistent.

If we do change number for timestamp (which would be a good thing), we should do it for all columns that currently use a number.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good. If the migration is anyway planed and you know how it will be implemented, then for the new code it may better to start using it. Just my 2c.

reporter VARCHAR(1024) NOT NULL,
reported VARCHAR(1024) NOT NULL,
reason VARCHAR(255) NOT NULL,
created BIGINT NOT NULL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the db2 has the TIMESTAMP type. In other places of the file used char for date, which it very strange

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(as above: we currently never use a TIMESTAMP field, which is certainly something I'd like to improve on. If we do improve on that, we should do it consistently, everywhere).

reported NVARCHAR(1024) NOT NULL,
reason NVARCHAR(255) NOT NULL,
created INTEGER NOT NULL,
"raw" LONG VARCHAR NOT NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename the raw field?
It contains the "XML representation of the report" but maybe we just need only the reported message so that we can easily parse the spam and train spamd.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That column is not great, indeed. I'm in the middle of a refactoring of the database structure. I think this column will be dropped (and instead we'll have a table that can be JOINed with, to have zero-to-many reported stanzas. More on this later!

}

final List<JID> toBlocks = new ArrayList<>();
final Set<SpamReport> reports = new HashSet<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the SpamReport doesn't have equals() so the Set won't check for duplicates. Maybe we can use a List here?

In general the HashSet eats a lot of memory and if it's possible to avoid it and use the ArrayList then it will reduce pressure on GC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh. Resizing ArrayLists when they are populated can be costly too. All such arguments border on premature optimization, which I'd like to avoid.

Instead, I prefer to look at the semantics:

  • usage of Set, unlike List, suggests that the entities are all different.
  • usage of List, unlike Set, suggests that entities are ordered.

In that sense, Set is a better fit for this use-case than List.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you a right and even more: this is a typical usage in all Java projects on which I worked, so this is kind of "industrial standard".
Java, as a "common denominator language", is both bad at business logic semantics and on the Mechanical Sympathy.

The HashSet is very memory expensive, so that it can be even used for a DDoS. When it's used inside of a method like here this is more or less fine because it will be fried. But when the (Hash)Set is used in a field (e.g. SpamReport's Set<Text> context) then the eaten memory remains used. So I used a few times a HashSet to deduplicate and then conversion to an ArrayList for a long storage.
The Android has the ArraySet but the OpenJDK still doesn't offer it out of the box.

Please note: the SpamReport doesn't have the equals() methods so the Set here is useless now.

public synchronized Set<Text> getContext()
{
if (context == null) {
context = new HashSet<>(Text.allFromChildren(reportElement));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we have an allocation and copy, two heavy operations


public static List<Text> allFromChildren(final Element parentElement)
{
final List<Text> result = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can set an initial capacity to texts.size()

listeners.forEach(listener -> {
try {
listener.receivedSpamReport(report);
} catch (Throwable t) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe catch an Exception and not the Throwable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested in learning the rationale for that. I've always been going back and forth between the two without a clear reason. I sometimes wish to catch Error instances, I think - but maybe that's not always appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 99% the Error would be the OutOfMemoryError. Generally this considered like a panic situation while an Exception is something that is recoverable.
A common practice is to catch Exception while leaving the Error only in critical situations https://stackoverflow.com/questions/2274102/difference-between-using-throwable-and-exception-in-a-try-catch

notification.setBody(body);
notification.getElement().add(reportElement);

XMPPServer.getInstance().getAdmins().forEach(jid -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor, style) you may use a for each loop that is easier to understand and faster to compile and analyze for linters

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style is in the eye of the beholder. :) The linter in Intellij is perfectly able to reason about this.

return null;
final List<StanzaID> sids = StanzaID.allFromChildren(packet.getElement());
return sids.stream()
.filter(stanzaID -> stanzaID.getBy().toString().equals(by))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the filter will be executed for all elements so this will work slower than previously.

@guusdk
Copy link
Member Author

guusdk commented Jan 15, 2025

Thanks for your feedback, @stokito! I've commented on most of them. Most of the others for me fall in similar categories of what I wrote elsewhere. This PR is still very much a work in progress, so do expect changes!

@guusdk
Copy link
Member Author

guusdk commented Jan 15, 2025

I have now applied a major refactoring.

@guusdk
Copy link
Member Author

guusdk commented Jan 15, 2025

Another thing that I can't get working yet is for Conversations (the client that I'm using to report spam) to include a stanza-ID that identifies a spam message.

This should be done by long-pressing a message from a stranger, as implemented here: https://codeberg.org/iNPUTmice/Conversations/src/branch/master/src/main/java/eu/siacs/conversations/ui/ConversationFragment.java#L1323-L1328

That fails for me. My theory is that Openfire doesn't include a stanza-ID to the messages. Does this need (changing in) the monitoring plugin?

@akrherz akrherz added this to the 5.1.0 milestone Mar 21, 2025
@akrherz
Copy link
Member

akrherz commented Mar 21, 2025

@guusdk I have assigned this to a 5.1.0 milestone as I suspect you are not intending it to be ready for 5.0.0 ?

guusdk added 7 commits April 16, 2025 12:20
This commit provides a basic implementation of XEP-0377: Spam Reporting

The changes include:
- persistent storage of spam reports
- a provider implementation to allow other spam reporting persistence providers
- an event listening mechanism
- an optional notification of admins
The XML element representation that's in a SpamReport should be a copy, and detached. This allows it to be used elsewhere, without one consumer's modifications affected another's.
When admins are notified of a new spam report, the report itself should be included. Supporting clients can then render it.
The data structure defined in XEP-0377 is only usable in context of IQ Blocking. It's not really usable for more generic incident reporting.

The refactoring moves the database structure towards a more re-usable structure:
- it no longer contains the XEP-0377 element as raw XML
- it references 0 to many stanzas in a different table
- the stanzas are minimally represented by a StanzaID, but may include the full XMPP stanza.
This adds a simple page to the admin console on which spam reports can be viewed.

The page is heavily based on the audit log viewer.
@guusdk guusdk force-pushed the OF-2954_Spam-Reporting branch from 8ec5fa3 to 4e38ff3 Compare April 16, 2025 10:33
The new version of the XEP introduces two flags that can be used to define if a spam report can be shared with others.

The admin console now has an additional details page for spam reports.
@guusdk guusdk force-pushed the OF-2954_Spam-Reporting branch from 39b5d0a to 543b1e7 Compare April 16, 2025 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants