Skip to content

Conversation

@hannahbast
Copy link
Member

@hannahbast hannahbast commented Dec 6, 2025

Optionally add, for each triple with a literal object, and for each word in that literal, an internal triple <literal> ql:has-word "word". The number of occurrences of the word in the literal is stored in the graph . This can be activated by calling IndexBuilderMain with option --add-has-word-triples, or by setting ADD_HAS_WORD_TRIPLES = true in the [index] section of the Qleverfile.

These triples can be used for customized text search. To make this efficient, materialized views can be used. For example, to enable full-text search in the rdfs:label and skos:altLabel literals of all subjects in your dataset, giving weight 5 to the former and weight 2 to the latter, you could create this materialized view:

qlever materialized-view words "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> SELECT ?word ?subject ?score ?tf ?weight WHERE { { ?subject rdfs:label ?text BIND (5 AS ?weight) } UNION { ?subject skos:altLabel ?text BIND (2 AS ?weight) } GRAPH ?tf { ?text ql:has-word ?word } BIND (?tf * ?weight AS ?score) }"

This view can then be efficiently queried as follows

SELECT ?subject (SUM(?s1 + ?s2 + ?s3) AS ?score) WHERE {
  SERVICE view:words { [ view:column-word "keyword1" ; view:column-subject ?uri; view:column-score ?s1 ] }
  SERVICE view:words { [ view:column-word "keyword2" ; view:column-subject ?uri; view:column-score ?s2 ] }
  SERVICE view:words { [ view:column-word "keyword3" ; view:column-subject ?uri; view:column-score ?s3 ] }
}

Hannah Bast added 5 commits December 6, 2025 02:41
During parsing, for each triples with a literal object, and for each
word in that literal, add an internal triple `subject ql:has-word
"word"`.

TODO: This is currently done unconditionally, which makes it easier to
test (we don't need special options in the Qleverfile). Eventually,
there should be an option `--add-has-word-triples` to `IndexBuilderMain`
to enable this behavior. Tests are also still missing
@codecov
Copy link

codecov bot commented Dec 6, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.52%. Comparing base (60b103a) to head (e04c91d).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #2579   +/-   ##
=======================================
  Coverage   91.51%   91.52%           
=======================================
  Files         478      478           
  Lines       41057    41088   +31     
  Branches     5463     5471    +8     
=======================================
+ Hits        37573    37604   +31     
+ Misses       1909     1908    -1     
- Partials     1575     1576    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Writing the position is more general. But computing the term frequencies
for each text-word pair is currently not efficient in QLever (it
requires too much memory and a GROUP BY with two variables is much
slower than a GROUP BY with one variable). Since we never needed
positions so far, but we do want term frequencies for scoring, let's
make this the default for now.
hannahbast added a commit to qlever-dev/qlever-control that referenced this pull request Dec 7, 2025
@hannahbast hannahbast changed the title Add ql:has-word triples to internal PSO&POS permutation Optionally add ql:has-word triples to internal PSO and POS permutations Dec 25, 2025
@sparql-conformance
Copy link

Overview

Number of Tests Passed ✅ Intended ✅ Failed ❌ Not tested
547 450 73 24 0

Conformance check passed ✅

No test result changes.

Details: https://qlever.dev/sparql-conformance-ui?cur=e04c91d5f744acf34ae7cc9d2d8b536a0b78ed28&prev=60b103ad3e616384ea7cc39b0428e9038d83ac38

@sonarqubecloud
Copy link

sonarqubecloud bot commented Jan 9, 2026

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds optional support for ql:has-word triples that enable keyword search in literals. For each literal in the dataset, internal triples of the form <literal> ql:has-word "word" are created for each distinct word, with term frequencies stored in the graph ID field. This feature can be activated via the --add-has-word-triples command-line option or by configuring the Qleverfile.

Key changes:

  • Added configuration option addHasWordTriples_ with default value true (noted as temporary for testing)
  • Renamed tripleToInternalRepresentation to processTriple and LangtagAndTriple to ProcessedTriple for clarity
  • Extended index building to tokenize literals, count word frequencies, and create corresponding internal triples

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/libqlever/Qlever.h Added addHasWordTriples_ configuration option to CommonConfig
src/libqlever/Qlever.cpp Propagate addHasWordTriples_ setting to index builder
src/index/IndexBuilderMain.cpp Added --add-has-word-triples command-line option
src/index/Index.h Added addHasWordTriples() accessor method
src/index/Index.cpp Implemented addHasWordTriples() accessor method
src/index/IndexImpl.h Added member variable addHasWordTriples_ and renamed method to processTriple
src/index/IndexImpl.cpp Implemented word tokenization in processTriple and logging for has-word triple count
src/index/IndexBuilderTypes.h Extended ProcessedTriple to include word frequencies; updated getIdMapLambdas to create has-word triples
src/global/Constants.h Added HAS_WORD_PREDICATE constant for the ql:has-word IRI
src/parser/TripleComponent.h Minor comment style change (triple-slash to double-slash)
test/util/IndexTestHelpers.h Added addHasWordTriples field to test configuration
test/util/IndexTestHelpers.cpp Set default test value for addHasWordTriples to false
test/IndexTest.cpp Renamed test to processTriple; updated expectations for internal triple counts with has-word triples enabled

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +359 to +360
ad_utility::triple_component::Literal::fromEscapedRdfLiteral(
absl::StrCat("\"", word, "\""))});
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using Literal::literalWithoutQuotes(word) instead of manually wrapping the word in quotes and calling fromEscapedRdfLiteral. Since the tokenized words only contain alphanumeric characters (no special characters requiring escaping), using literalWithoutQuotes would be more direct and clearer in intent.

Suggested change
ad_utility::triple_component::Literal::fromEscapedRdfLiteral(
absl::StrCat("\"", word, "\""))});
ad_utility::triple_component::Literal::literalWithoutQuotes(
word)});

Copilot uses AI. Check for mistakes.
Comment on lines +68 to +70
// NOTE: While testing, this defaults to true. Eventually, it should default
// to false.
bool addHasWordTriples_ = true;
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment indicates this defaults to true "while testing" and "eventually, it should default to false." This suggests a temporary state during development. Consider whether the default should be changed to false before merging, or if a TODO/issue should track this decision. Having a default of true could unexpectedly increase index sizes for users who don't explicitly set this option.

Suggested change
// NOTE: While testing, this defaults to true. Eventually, it should default
// to false.
bool addHasWordTriples_ = true;
// NOTE: This defaults to false to avoid increasing index sizes
// unexpectedly. Enable explicitly if keyword search in literals is needed.
bool addHasWordTriples_ = false;

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@RobinTF RobinTF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly a lot of comments

serializer | arg.iriOrLiteral_;
serializer | arg.isExternal_;
}
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this struct moved in the first place? I don't think anything changed here, right?

// In each map, assign the first IDs to the special IRIs `ql:langtag` and
// `ql:has-word`.
//
// NOTE: This is not necessary for functionality, but certain unit tests
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the TODO removed?

Comment on lines +324 to +326
// NOTE: There is similar code in `DeltaTriples::makeInternalTriples`
// for adding these internal triples for update triples. If you change
// this code, you probably also have to change that one.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How difficult would it be to make this also work with UPDATEs?
As you know we're planning to implement the internal triples ql::langtag asymmetrically, so that they are added when a literal is added but never removed. Would this also work here? Or is there a problem if these triples exist if the original entry was already removed?

// Third, if applicable, add a `ql:has-word` triple for each distinct word
// in the literal. We abuse the graph ID field to store the term
// frequency of the word in the literal.
if (!lt.wordFrequencies_.empty()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lambda is already very lengthy. Did you consider extracting this final part to a dedicated function?

// Update the counter for the number of ql:has-word triples.
if (numHasWordTriples != nullptr) {
numHasWordTriples->fetch_add(lt.wordFrequencies_.size(),
std::memory_order_relaxed);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason to use std::memory_order_relaxed here. I believe it is safe to do here, but it is very hard to reason about this memory order in comparison to the default memory order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants