-
Notifications
You must be signed in to change notification settings - Fork 107
Optionally add ql:has-word triples to internal PSO and POS permutations
#2579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
During parsing, for each triples with a literal object, and for each word in that literal, add an internal triple `subject ql:has-word "word"`. TODO: This is currently done unconditionally, which makes it easier to test (we don't need special options in the Qleverfile). Eventually, there should be an option `--add-has-word-triples` to `IndexBuilderMain` to enable this behavior. Tests are also still missing
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #2579 +/- ##
=======================================
Coverage 91.51% 91.52%
=======================================
Files 478 478
Lines 41057 41088 +31
Branches 5463 5471 +8
=======================================
+ Hits 37573 37604 +31
+ Misses 1909 1908 -1
- Partials 1575 1576 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Writing the position is more general. But computing the term frequencies for each text-word pair is currently not efficient in QLever (it requires too much memory and a GROUP BY with two variables is much slower than a GROUP BY with one variable). Since we never needed positions so far, but we do want term frequencies for scoring, let's make this the default for now.
This complements ad-freiburg/qlever#2579
ql:has-word triples to internal PSO&POS permutationql:has-word triples to internal PSO and POS permutations
Overview
Conformance check passed ✅No test result changes. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds optional support for ql:has-word triples that enable keyword search in literals. For each literal in the dataset, internal triples of the form <literal> ql:has-word "word" are created for each distinct word, with term frequencies stored in the graph ID field. This feature can be activated via the --add-has-word-triples command-line option or by configuring the Qleverfile.
Key changes:
- Added configuration option
addHasWordTriples_with default valuetrue(noted as temporary for testing) - Renamed
tripleToInternalRepresentationtoprocessTripleandLangtagAndTripletoProcessedTriplefor clarity - Extended index building to tokenize literals, count word frequencies, and create corresponding internal triples
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/libqlever/Qlever.h | Added addHasWordTriples_ configuration option to CommonConfig |
| src/libqlever/Qlever.cpp | Propagate addHasWordTriples_ setting to index builder |
| src/index/IndexBuilderMain.cpp | Added --add-has-word-triples command-line option |
| src/index/Index.h | Added addHasWordTriples() accessor method |
| src/index/Index.cpp | Implemented addHasWordTriples() accessor method |
| src/index/IndexImpl.h | Added member variable addHasWordTriples_ and renamed method to processTriple |
| src/index/IndexImpl.cpp | Implemented word tokenization in processTriple and logging for has-word triple count |
| src/index/IndexBuilderTypes.h | Extended ProcessedTriple to include word frequencies; updated getIdMapLambdas to create has-word triples |
| src/global/Constants.h | Added HAS_WORD_PREDICATE constant for the ql:has-word IRI |
| src/parser/TripleComponent.h | Minor comment style change (triple-slash to double-slash) |
| test/util/IndexTestHelpers.h | Added addHasWordTriples field to test configuration |
| test/util/IndexTestHelpers.cpp | Set default test value for addHasWordTriples to false |
| test/IndexTest.cpp | Renamed test to processTriple; updated expectations for internal triple counts with has-word triples enabled |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ad_utility::triple_component::Literal::fromEscapedRdfLiteral( | ||
| absl::StrCat("\"", word, "\""))}); |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using Literal::literalWithoutQuotes(word) instead of manually wrapping the word in quotes and calling fromEscapedRdfLiteral. Since the tokenized words only contain alphanumeric characters (no special characters requiring escaping), using literalWithoutQuotes would be more direct and clearer in intent.
| ad_utility::triple_component::Literal::fromEscapedRdfLiteral( | |
| absl::StrCat("\"", word, "\""))}); | |
| ad_utility::triple_component::Literal::literalWithoutQuotes( | |
| word)}); |
| // NOTE: While testing, this defaults to true. Eventually, it should default | ||
| // to false. | ||
| bool addHasWordTriples_ = true; |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment indicates this defaults to true "while testing" and "eventually, it should default to false." This suggests a temporary state during development. Consider whether the default should be changed to false before merging, or if a TODO/issue should track this decision. Having a default of true could unexpectedly increase index sizes for users who don't explicitly set this option.
| // NOTE: While testing, this defaults to true. Eventually, it should default | |
| // to false. | |
| bool addHasWordTriples_ = true; | |
| // NOTE: This defaults to false to avoid increasing index sizes | |
| // unexpectedly. Enable explicitly if keyword search in literals is needed. | |
| bool addHasWordTriples_ = false; |
RobinTF
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly a lot of comments
| serializer | arg.iriOrLiteral_; | ||
| serializer | arg.isExternal_; | ||
| } | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this struct moved in the first place? I don't think anything changed here, right?
| // In each map, assign the first IDs to the special IRIs `ql:langtag` and | ||
| // `ql:has-word`. | ||
| // | ||
| // NOTE: This is not necessary for functionality, but certain unit tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the TODO removed?
| // NOTE: There is similar code in `DeltaTriples::makeInternalTriples` | ||
| // for adding these internal triples for update triples. If you change | ||
| // this code, you probably also have to change that one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How difficult would it be to make this also work with UPDATEs?
As you know we're planning to implement the internal triples ql::langtag asymmetrically, so that they are added when a literal is added but never removed. Would this also work here? Or is there a problem if these triples exist if the original entry was already removed?
| // Third, if applicable, add a `ql:has-word` triple for each distinct word | ||
| // in the literal. We abuse the graph ID field to store the term | ||
| // frequency of the word in the literal. | ||
| if (!lt.wordFrequencies_.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lambda is already very lengthy. Did you consider extracting this final part to a dedicated function?
| // Update the counter for the number of ql:has-word triples. | ||
| if (numHasWordTriples != nullptr) { | ||
| numHasWordTriples->fetch_add(lt.wordFrequencies_.size(), | ||
| std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason to use std::memory_order_relaxed here. I believe it is safe to do here, but it is very hard to reason about this memory order in comparison to the default memory order.



Optionally add, for each triple with a literal object, and for each word in that literal, an internal triple
<literal> ql:has-word "word". The number of occurrences of the word in the literal is stored in the graph . This can be activated by callingIndexBuilderMainwith option--add-has-word-triples, or by settingADD_HAS_WORD_TRIPLES = truein the[index]section of theQleverfile.These triples can be used for customized text search. To make this efficient, materialized views can be used. For example, to enable full-text search in the
rdfs:labelandskos:altLabelliterals of all subjects in your dataset, giving weight 5 to the former and weight 2 to the latter, you could create this materialized view:qlever materialized-view words "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> SELECT ?word ?subject ?score ?tf ?weight WHERE { { ?subject rdfs:label ?text BIND (5 AS ?weight) } UNION { ?subject skos:altLabel ?text BIND (2 AS ?weight) } GRAPH ?tf { ?text ql:has-word ?word } BIND (?tf * ?weight AS ?score) }"This view can then be efficiently queried as follows