Optionally add `ql:has-word` triples to internal PSO and POS permutations #2579

hannahbast · 2025-12-06T01:44:49Z

Optionally add, for each triple with a literal object, and for each word in that literal, an internal triple <literal> ql:has-word "word". The number of occurrences of the word in the literal is stored in the graph . This can be activated by calling IndexBuilderMain with option --add-has-word-triples, or by setting ADD_HAS_WORD_TRIPLES = true in the [index] section of the Qleverfile.

These triples can be used for customized text search. To make this efficient, materialized views can be used. For example, to enable full-text search in the rdfs:label and skos:altLabel literals of all subjects in your dataset, giving weight 5 to the former and weight 2 to the latter, you could create this materialized view:

qlever materialized-view words "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> SELECT ?word ?subject ?score ?tf ?weight WHERE { { ?subject rdfs:label ?text BIND (5 AS ?weight) } UNION { ?subject skos:altLabel ?text BIND (2 AS ?weight) } GRAPH ?tf { ?text ql:has-word ?word } BIND (?tf * ?weight AS ?score) }"

This view can then be efficiently queried as follows

SELECT ?subject (SUM(?s1 + ?s2 + ?s3) AS ?score) WHERE {
  SERVICE view:words { [ view:column-word "keyword1" ; view:column-subject ?uri; view:column-score ?s1 ] }
  SERVICE view:words { [ view:column-word "keyword2" ; view:column-subject ?uri; view:column-score ?s2 ] }
  SERVICE view:words { [ view:column-word "keyword3" ; view:column-subject ?uri; view:column-score ?s3 ] }
}

During parsing, for each triples with a literal object, and for each word in that literal, add an internal triple `subject ql:has-word "word"`. TODO: This is currently done unconditionally, which makes it easier to test (we don't need special options in the Qleverfile). Eventually, there should be an option `--add-has-word-triples` to `IndexBuilderMain` to enable this behavior. Tests are also still missing

codecov · 2025-12-06T18:58:39Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.52%. Comparing base (60b103a) to head (e04c91d).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #2579   +/-   ##
=======================================
  Coverage   91.51%   91.52%           
=======================================
  Files         478      478           
  Lines       41057    41088   +31     
  Branches     5463     5471    +8     
=======================================
+ Hits        37573    37604   +31     
+ Misses       1909     1908    -1     
- Partials     1575     1576    +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Writing the position is more general. But computing the term frequencies for each text-word pair is currently not efficient in QLever (it requires too much memory and a GROUP BY with two variables is much slower than a GROUP BY with one variable). Since we never needed positions so far, but we do want term frequencies for scoring, let's make this the default for now.

This complements ad-freiburg/qlever#2579

sparql-conformance · 2026-01-09T20:11:41Z

Overview

Number of Tests	Passed ✅	Intended ✅	Failed ❌	Not tested
547	450	73	24	0

Conformance check passed ✅

No test result changes.

Details: https://qlever.dev/sparql-conformance-ui?cur=e04c91d5f744acf34ae7cc9d2d8b536a0b78ed28&prev=60b103ad3e616384ea7cc39b0428e9038d83ac38

sonarqubecloud · 2026-01-09T21:12:54Z

Quality Gate passed

Issues
10 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Copilot

Pull request overview

This PR adds optional support for ql:has-word triples that enable keyword search in literals. For each literal in the dataset, internal triples of the form <literal> ql:has-word "word" are created for each distinct word, with term frequencies stored in the graph ID field. This feature can be activated via the --add-has-word-triples command-line option or by configuring the Qleverfile.

Key changes:

Added configuration option addHasWordTriples_ with default value true (noted as temporary for testing)
Renamed tripleToInternalRepresentation to processTriple and LangtagAndTriple to ProcessedTriple for clarity
Extended index building to tokenize literals, count word frequencies, and create corresponding internal triples

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/libqlever/Qlever.h	Added `addHasWordTriples_` configuration option to `CommonConfig`
src/libqlever/Qlever.cpp	Propagate `addHasWordTriples_` setting to index builder
src/index/IndexBuilderMain.cpp	Added `--add-has-word-triples` command-line option
src/index/Index.h	Added `addHasWordTriples()` accessor method
src/index/Index.cpp	Implemented `addHasWordTriples()` accessor method
src/index/IndexImpl.h	Added member variable `addHasWordTriples_` and renamed method to `processTriple`
src/index/IndexImpl.cpp	Implemented word tokenization in `processTriple` and logging for has-word triple count
src/index/IndexBuilderTypes.h	Extended `ProcessedTriple` to include word frequencies; updated `getIdMapLambdas` to create has-word triples
src/global/Constants.h	Added `HAS_WORD_PREDICATE` constant for the `ql:has-word` IRI
src/parser/TripleComponent.h	Minor comment style change (triple-slash to double-slash)
test/util/IndexTestHelpers.h	Added `addHasWordTriples` field to test configuration
test/util/IndexTestHelpers.cpp	Set default test value for `addHasWordTriples` to `false`
test/IndexTest.cpp	Renamed test to `processTriple`; updated expectations for internal triple counts with has-word triples enabled

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-09T22:51:11Z

src/index/IndexBuilderTypes.h

+              ad_utility::triple_component::Literal::fromEscapedRdfLiteral(
+                  absl::StrCat("\"", word, "\""))});


Consider using Literal::literalWithoutQuotes(word) instead of manually wrapping the word in quotes and calling fromEscapedRdfLiteral. Since the tokenized words only contain alphanumeric characters (no special characters requiring escaping), using literalWithoutQuotes would be more direct and clearer in intent.

Suggested change

ad_utility::triple_component::Literal::fromEscapedRdfLiteral(

absl::StrCat("\"", word, "\""))});

ad_utility::triple_component::Literal::literalWithoutQuotes(

word)});

Copilot · 2026-01-09T22:51:11Z

src/libqlever/Qlever.h

+  // NOTE: While testing, this defaults to true. Eventually, it should default
+  // to false.
+  bool addHasWordTriples_ = true;


The comment indicates this defaults to true "while testing" and "eventually, it should default to false." This suggests a temporary state during development. Consider whether the default should be changed to false before merging, or if a TODO/issue should track this decision. Having a default of true could unexpectedly increase index sizes for users who don't explicitly set this option.

Suggested change

// NOTE: While testing, this defaults to true. Eventually, it should default

// to false.

bool addHasWordTriples_ = true;

// NOTE: This defaults to false to avoid increasing index sizes

// unexpectedly. Enable explicitly if keyword search in literals is needed.

bool addHasWordTriples_ = false;

RobinTF

Mainly a lot of comments

RobinTF · 2026-01-09T23:45:15Z

src/index/IndexBuilderTypes.h

+    serializer | arg.iriOrLiteral_;
+    serializer | arg.isExternal_;
+  }
+};


Why is this struct moved in the first place? I don't think anything changed here, right?

RobinTF · 2026-01-09T23:49:08Z

src/index/IndexBuilderTypes.h

+    // In each map, assign the first IDs to the special IRIs `ql:langtag` and
+    // `ql:has-word`.
+    //
+    // NOTE: This is not necessary for functionality, but certain unit tests


Why is the TODO removed?

RobinTF · 2026-01-09T23:54:28Z

src/index/IndexBuilderTypes.h

+      // NOTE: There is similar code in `DeltaTriples::makeInternalTriples`
+      // for adding these internal triples for update triples. If you change
+      // this code, you probably also have to change that one.


How difficult would it be to make this also work with UPDATEs?
As you know we're planning to implement the internal triples ql::langtag asymmetrically, so that they are added when a literal is added but never removed. Would this also work here? Or is there a problem if these triples exist if the original entry was already removed?

RobinTF · 2026-01-09T23:57:09Z

src/index/IndexBuilderTypes.h

+      // Third, if applicable, add a `ql:has-word` triple for each distinct word
+      // in the literal. We abuse the graph ID field to store the term
+      // frequency of the word in the literal.
+      if (!lt.wordFrequencies_.empty()) {


This lambda is already very lengthy. Did you consider extracting this final part to a dedicated function?

RobinTF · 2026-01-10T00:02:13Z

src/index/IndexBuilderTypes.h

+        // Update the counter for the number of ql:has-word triples.
+        if (numHasWordTriples != nullptr) {
+          numHasWordTriples->fetch_add(lt.wordFrequencies_.size(),
+                                       std::memory_order_relaxed);


Is there a specific reason to use std::memory_order_relaxed here. I believe it is safe to do here, but it is very hard to reason about this memory order in comparison to the default memory order.

Hannah Bast added 5 commits December 6, 2025 02:41

The subject of the ql:has-word triple should be the literal

f0389da

Add option --add-has-word-triples to IndexBuilderMain

5113120

Fix failing unit tests

75cf0c7

Add word positions (as graph) and log number of triples added

961377d

hannahbast mentioned this pull request Dec 7, 2025

Add option HAS_WORD_TRIPLES for qlever index command qlever-dev/qlever-control#221

Merged

hannahbast added a commit to qlever-dev/qlever-control that referenced this pull request Dec 7, 2025

Add option HAS_WORD_TRIPLES for qlever index command (#221)

ad083e0

This complements ad-freiburg/qlever#2579

hannahbast changed the title ~~Add ql:has-word triples to internal PSO&POS permutation~~ Optionally add ql:has-word triples to internal PSO and POS permutations Dec 25, 2025

Hannah Bast added 2 commits December 25, 2025 16:28

Cleaned up the code and many comments (also in related code)

0b4499d

Merge remote-tracking branch 'origin/master' into add-text-triples

e04c91d

hannahbast requested review from RobinTF and Copilot January 9, 2026 22:47

Copilot started reviewing on behalf of hannahbast January 9, 2026 22:47 View session

Copilot AI reviewed Jan 9, 2026

View reviewed changes

RobinTF reviewed Jan 10, 2026

View reviewed changes

		ad_utility::triple_component::Literal::fromEscapedRdfLiteral(
		absl::StrCat("\"", word, "\""))});

Optionally add ql:has-word triples to internal PSO and POS permutations #2579

Are you sure you want to change the base?

Optionally add ql:has-word triples to internal PSO and POS permutations #2579

Uh oh!

Conversation

hannahbast commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sparql-conformance bot commented Jan 9, 2026

Overview

Conformance check passed ✅

Uh oh!

sonarqubecloud bot commented Jan 9, 2026

Quality Gate passed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

RobinTF left a comment

Choose a reason for hiding this comment

Uh oh!

RobinTF Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

RobinTF Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

RobinTF Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

RobinTF Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

RobinTF Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Optionally add `ql:has-word` triples to internal PSO and POS permutations #2579

Optionally add `ql:has-word` triples to internal PSO and POS permutations #2579

hannahbast commented Dec 6, 2025 •

edited

Loading

codecov bot commented Dec 6, 2025 •

edited

Loading