-
Notifications
You must be signed in to change notification settings - Fork 107
First version of magic proxy service #2575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Includes `qlproxy_server.py` for end-to-end testing. This should eventually be removed again. For example, when that server is started with `python3 qlproxy_server.py --port 8888 verbose`, the the following should work in QLever: ```sparql PREFIX qlproxy: <https://qlever.cs.uni-freiburg.de/qlproxy/> SELECT ?result WHERE { VALUES (?x1 ?x2) { (1 3) (2 2) (3 1) } SERVICE qlproxy: { _:config qlproxy:endpoint <https://qlever.dev/api/proxy-test> ; qlproxy:payload_first ?x1 ; qlproxy:payload_second ?x2 ; qlproxy:result_res ?result ; qlproxy:param_op "add" . } } ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a new magic proxy service (SERVICE qlproxy:) for QLever that enables SPARQL queries to send payload bindings to remote endpoints and receive result bindings back. The implementation allows arithmetic and other operations to be delegated to external services, with payload and result variables mapped between QLever and the remote endpoint.
Key changes:
- Added
ProxyQueryparser to handleqlproxy:service configuration with endpoint URL, payload variables, result variables, and URL parameters - Implemented
ProxyOperationengine operation that serializes payload as SPARQL Results JSON, sends HTTP POST requests, and parses responses - Integrated proxy operation into the query planner with child operation support similar to
SpatialJoin
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/parser/ProxyQuery.h |
Defines ProxyQuery and ProxyConfiguration structures for parsing and validating qlproxy service configuration |
src/parser/ProxyQuery.cpp |
Implements parameter extraction and validation for endpoint, payload, result, and URL parameter specifications |
src/parser/MagicServiceIriConstants.h |
Adds QLPROXY_IRI constant for the qlproxy service IRI |
src/parser/GraphPatternOperation.h |
Adds ProxyQuery to the graph pattern operation variant |
src/parser/sparqlParser/SparqlQleverVisitor.h |
Declares visitProxyQuery method for SPARQL parser visitor |
src/parser/sparqlParser/SparqlQleverVisitor.cpp |
Implements visitor method to parse and validate qlproxy service clauses |
src/parser/CMakeLists.txt |
Adds ProxyQuery.cpp to parser build |
src/engine/ProxyOperation.h |
Defines ProxyOperation class with HTTP request handling and JSON serialization |
src/engine/ProxyOperation.cpp |
Implements proxy operation execution, including payload serialization, HTTP communication, and result parsing |
src/engine/QueryPlanner.h |
Declares methods for proxy operation planning and child addition |
src/engine/QueryPlanner.cpp |
Implements query planner logic to join proxy operations with child operations providing payload variables |
src/engine/CheckUsePatternTrick.cpp |
Adds ProxyQuery to operations where pattern trick is disabled |
src/engine/Service.h |
Removes bindingToTripleComponent method (moved to shared utility) |
src/engine/Service.cpp |
Refactors to use sparqlJsonBindingUtils::bindingToTripleComponent instead of member method |
src/engine/CMakeLists.txt |
Adds ProxyOperation.cpp to engine build and updates comment about C++20-dependent operations |
test/ProxyOperationTest.cpp |
Comprehensive tests for ProxyOperation including basic methods, multiple result variables, URL parameters, and error handling |
test/ServiceTest.cpp |
Updates bindingToTripleComponent test to use shared utility function |
test/CMakeLists.txt |
Adds ProxyOperationTest and SparqlJsonBindingUtilsTest to test suite |
qlproxy_server.py |
Python test server for end-to-end testing, performing arithmetic operations on received bindings |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| first = false; | ||
| // URL-encode the value (simple encoding for common characters). | ||
| url += name + "=" + |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter name should also be URL-encoded to handle special characters properly. Currently only the value is encoded. Consider using:
url += boost::urls::encode(name, boost::urls::unreserved_chars) + "=" +
boost::urls::encode(value, boost::urls::unreserved_chars);| url += name + "=" + | |
| url += boost::urls::encode(name, boost::urls::unreserved_chars) + "=" + |
qlproxy_server.py
Outdated
|
|
||
| # Build the response | ||
| response = { | ||
| "head": {"vars": ["res"]}, |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded variable name "res" doesn't match the example in the docstring (lines 16-19), which shows qlproxy:result_result ?sum, implying the result variable name should be "result". The server should either:
- Accept the variable names from the request's "head.vars" field and echo them back in the response, or
- Update the docstring example to use
qlproxy:result_res ?sumto match the hardcoded "res" name.
Option 1 is more flexible and correct for a general-purpose test server.
| std::string url = config_.endpoint_; | ||
| if (!config_.parameters_.empty()) { | ||
| url += "?"; | ||
| bool first = true; | ||
| for (const auto& [name, value] : config_.parameters_) { | ||
| if (!first) { | ||
| url += "&"; | ||
| } | ||
| first = false; | ||
| // URL-encode the value (simple encoding for common characters). | ||
| url += name + "=" + | ||
| boost::urls::encode(value, boost::urls::unreserved_chars); | ||
| } | ||
| } |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The URL string is built using repeated += operations which can cause multiple reallocations. For better performance with many parameters, consider using absl::StrCat or preallocating the string size, similar to:
std::vector<std::string> parts;
parts.reserve(config_.parameters_.size());
for (const auto& [name, value] : config_.parameters_) {
parts.push_back(absl::StrCat(name, "=",
boost::urls::encode(value, boost::urls::unreserved_chars)));
}
return absl::StrCat(config_.endpoint_, "?", absl::StrJoin(parts, "&"));
src/parser/ProxyQuery.cpp
Outdated
| } else { | ||
| throw ProxyException(absl::StrCat( | ||
| "Unsupported parameter `", predString, | ||
| "` in qlproxy service`. Supported parameters are: `<endpoint>`, " |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The backtick placement in this error message is incorrect. The string currently reads `in qlproxy service`. which places the backtick before "service`" oddly. It should either be:
in qlproxy service.(remove the misplaced backtick), or`in qlproxy service.`(move the closing backtick after the period)
Most likely the backtick should be removed entirely for consistency with the rest of the message.
| "` in qlproxy service`. Supported parameters are: `<endpoint>`, " | |
| "` in qlproxy service. Supported parameters are: `<endpoint>`, " |
src/engine/CMakeLists.txt
Outdated
| CartesianProductJoin.cpp TextIndexScanForWord.cpp TextIndexScanForEntity.cpp | ||
| TextLimit.cpp LazyGroupBy.cpp GroupByHashMapOptimization.cpp SpatialJoin.cpp | ||
| CountConnectedSubgraphs.cpp SpatialJoinAlgorithms.cpp PathSearch.cpp ExecuteUpdate.cpp | ||
| CountConnectedSubgraphs.cpp SpatialJoinAlgorithms.cpp PathSearch.cpp ProxyOperation.cpp ExecuteUpdate.cpp |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding ProxyOperation.cpp to the engine target enables the SERVICE qlproxy: feature, whose implementation sends HTTP/HTTPS requests to the user-specified qlproxy:endpoint URL with no host/port restrictions. On a publicly reachable SPARQL endpoint this lets an attacker craft queries that cause QLever to issue arbitrary HTTP(S) POSTs to internal services (e.g., 127.0.0.1, cloud metadata IPs, admin panels), which is a classic SSRF primitive. To mitigate this, gate SERVICE qlproxy: behind server-side configuration and/or enforce a strict allowlist for proxy targets (e.g., configured hostnames and safe ports, blocking private/reserved IP ranges) instead of accepting arbitrary external URLs from queries.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #2575 +/- ##
==========================================
- Coverage 91.24% 90.70% -0.54%
==========================================
Files 477 482 +5
Lines 40835 41212 +377
Branches 5436 5506 +70
==========================================
+ Hits 37260 37382 +122
- Misses 2028 2264 +236
- Partials 1547 1566 +19 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this this morning and it works very well as long as there is a single proxy service. If you try to combine multiple proxy services, it is quite easy to generate cryptic assertion failures:
PREFIX qlproxy: <https://qlever.cs.uni-freiburg.de/qlproxy/>
SELECT ?result2 WHERE {
VALUES (?x1 ?x2) { (1 3) (2 2) (3 1) }
SERVICE qlproxy: {
_:config qlproxy:endpoint <https://qlever.dev/api/proxy-test> ;
qlproxy:payload_first ?x1 ;
qlproxy:payload_second ?x2 ;
qlproxy:result_res ?result ;
qlproxy:param_op "add" .
}
SERVICE qlproxy: {
_:config qlproxy:endpoint <https://qlever.dev/api/proxy-test> ;
qlproxy:payload_first ?x1 ;
qlproxy:payload_second ?x2 ;
qlproxy:result_res ?result2 ;
qlproxy:param_op "add" .
}
}
results in an ERROR: Assertion 'res > 0' failed. in MultiColumnJoin.cpp. I guess that query doesn't make any sense because the variables ?x1 and ?x2 are not "passed through" by the proxy, but the error message should be better.
For the following query, I would naively expect a result of
8
8
8
but I am getting ERROR: Assertion 'variablesAreDisjoint' failed.:
PREFIX qlproxy: <https://qlever.cs.uni-freiburg.de/qlproxy/>
SELECT ?result2 WHERE {
VALUES (?x1 ?x2) { (1 3) (2 2) (3 1) }
SERVICE qlproxy: {
_:config qlproxy:endpoint <https://qlever.dev/api/proxy-test> ;
qlproxy:payload_first ?x1 ;
qlproxy:payload_second ?x2 ;
qlproxy:result_res ?result ;
qlproxy:param_op "add" .
}
SERVICE qlproxy: {
_:config qlproxy:endpoint <https://qlever.dev/api/proxy-test> ;
qlproxy:payload_first ?result ;
qlproxy:payload_second ?result ;
qlproxy:result_res ?result2 ;
qlproxy:param_op "add" .
}
}
However, I am not sure whether queries which include multiple SERVICE proxy calls like that even make any sense.
EDIT: in the query above, the following works:
PREFIX qlproxy: <https://qlever.cs.uni-freiburg.de/qlproxy/>
SELECT ?result2 WHERE {
{
SELECT * WHERE {
VALUES (?x1 ?x2) { (1 3) (2 2) (3 1) }
SERVICE qlproxy: {
_:config qlproxy:endpoint <https://qlever.dev/api/proxy-test> ;
qlproxy:payload_first ?x1 ;
qlproxy:payload_second ?x2 ;
qlproxy:result_res ?result ;
qlproxy:param_op "add" .
}
}
}
SERVICE qlproxy: {
_:config qlproxy:endpoint <https://qlever.dev/api/proxy-test> ;
qlproxy:payload_first ?result ;
qlproxy:payload_second ?result ;
qlproxy:result_res ?result2 ;
qlproxy:param_op "add" .
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 22 out of 22 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| inputVariables_.emplace_back(std::move(paramName), std::move(var)); | ||
| } else if (predString == "output-row") { | ||
| // Special case: the row variable for joining. | ||
| Variable var = getVariable(predString, object); |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The row variable name extraction assumes that variable names always start with '?', but this may not be guaranteed depending on how variables are constructed. Consider adding a validation check to ensure the variable name has at least 2 characters before calling substr(1), or use a safer method that checks for the '?' prefix explicitly.
| Variable var = getVariable(predString, object); | |
| Variable var = getVariable(predString, object); | |
| // Validate that the variable name starts with '?' and has at least 2 characters. | |
| throwIf(var.name().size() < 2 || var.name()[0] != '?', | |
| absl::StrCat("The row variable name '", var.name(), | |
| "' is invalid. It must start with '?' and have at least one character after '?'.")); |
| if (rowIdx1Based < 1 || | ||
| static_cast<size_t>(rowIdx1Based) > childTable.size()) { |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code converts a signed int64_t to size_t for comparison without checking if rowIdx1Based is negative first. While line 289 checks if it's less than 1, the cast on line 290 could cause issues if rowIdx1Based is a very large negative number. Consider reordering the logic to check the negative case before casting, or use a safer comparison approach.
| if (rowIdx1Based < 1 || | |
| static_cast<size_t>(rowIdx1Based) > childTable.size()) { | |
| if (rowIdx1Based < 1 || rowIdx1Based > static_cast<int64_t>(childTable.size())) { |
| auto proxyOp = static_cast<Proxy*>(op.get()); | ||
|
|
||
| if (proxyOp->isConstructed()) { |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using static_cast here is unsafe since there's no guarantee that the operation is actually a Proxy. The code already uses dynamic_pointer_cast above to check if it's a Proxy, but then uses static_cast on the raw pointer. Consider using dynamic_cast or storing the result from the earlier dynamic_pointer_cast to avoid potential undefined behavior.
| auto proxyOp = static_cast<Proxy*>(op.get()); | |
| if (proxyOp->isConstructed()) { | |
| auto proxyOpShared = std::dynamic_pointer_cast<Proxy>(op); | |
| if (!proxyOpShared) { | |
| // This should not happen; indicates a logic error. | |
| throw std::runtime_error("Expected a Proxy operation but got a different type."); | |
| } | |
| if (proxyOpShared->isConstructed()) { |
Overview
Conformance check passed ✅No test result changes. |
|


Implement a generic proxy service that sends arbitrary
sparql-results+jsoninput to a remote endpoint, receivessparql-results+jsonoutput from that endpoint, and joins this with the enclosing graph pattern. As usual for QLever's magic services, the configuration is provided via_:config ql:proxy:... ...triples. The input is provided viaql:proxy-<input> ?var?, where?varis a variable from the enclosing graph pattern, andinputis the name of the variable sent to the endpoint. Additionally, URL parameters can be specified via... qlproxy:param-<key> <value>triples; they will be added to the endpoint request. The output is provided viaql:proxy-<output> ?var?, whereoutputis the name of the variable sent back by the endpoint, and?varis the variable to which the respective results are bound in the query. The input sent to the endpoint is augmented by the variable specified byql:proxy:output-row, which provides the row index for each binding. The endpoint must provide a binding for that same variable for each solution it returns. The result of the service request is joined with the result of the enclosing graph pattern on this variable. Here is an example query. A Python scriptqlproxy_qgrams.pythat implements the endpoint for this example, is part of this PR.