Skip to content

OAK-12101 - Skip indexing of very long similarity tags#2747

Closed
antonhosgood wants to merge 10 commits intoapache:trunkfrom
antonhosgood:OAK-12101
Closed

OAK-12101 - Skip indexing of very long similarity tags#2747
antonhosgood wants to merge 10 commits intoapache:trunkfrom
antonhosgood:OAK-12101

Conversation

@antonhosgood
Copy link

Very long similarity tag names inflate index sizes and probably defeat the purpose of tags. This PR introduces an optional configurable value per index to configure how similarity tags should be filtered in indexSimilarityTag and indexDynamicBoost.

Comment on lines 751 to 758
private boolean isTagWithinLengthLimit(String value, String pname) {
int maxLength = definition.getSimilarityTagMaxLength();
if (maxLength < 0 || value.length() <= maxLength) {
return true;
}
if (!LOG_SILENCER.silence(LOG_KEY_SIMILARITY_TAG_SKIPPED)) {
log.warn("[{}] Skipping similarity tag for property {}. Value length {} exceeds maximum allowed length {}",
getIndexName(), pname, value.length(), maxLength);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just my view) I think it's less surprising if the logging itself is done outside of this method, that is in the caller. If it is needed in multiple places, that would mean code duplication or another method, which is also bad. An alternative would be to change the method name to eg. "isTagWithinLengthLimitOrLogWarning". Yes it's a long name... but it's less surprising.


private final Logger log = LoggerFactory.getLogger(getClass());

private static final LogSilencer LOG_SILENCER = new LogSilencer(Duration.ofMinutes(1).toMillis(), 4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks good... alternatively, we could use the default silencer settings (15 min, 64 entries cache)

if (maxLength < 0 || value.length() <= maxLength) {
return true;
}
if (!LOG_SILENCER.silence(LOG_KEY_SIMILARITY_TAG_SKIPPED)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means only one key. I think it would be better to use the property name as the key:

Suggested change
if (!LOG_SILENCER.silence(LOG_KEY_SIMILARITY_TAG_SKIPPED)) {
if (!LOG_SILENCER.silence(pname)) {

If there are multiple properties indexed like that, each property name will have a separate silence period. I'm not sure if this happens in reality thought.

@reschke reschke self-requested a review February 26, 2026 09:46
Copy link
Contributor

@reschke reschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not merge until the Sonar checks have been successful.

See https://jackrabbit.apache.org/oak/docs/participating.html#pull-requests-prs

@thomasmueller
Copy link
Member

Could you open a new PR against the (new) OAK-12101 branch please, so that we can run Sonar tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants