-
Notifications
You must be signed in to change notification settings - Fork 381
VUFIND-1210: Use Solr JSON APIs #4991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
| $params = $params ?: new ParamBagBag(); | ||
| $params = ParamBagBag::from($params); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from should just create the object if needed, if a param indicates to do so
| $params = $params ?: new ParamBagBag(); | ||
| $params = ParamBagBag::from($params); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above re from
| * @var array | ||
| */ | ||
| protected $params = []; | ||
| protected $items = []; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All Solr queries will now have a very large array params within the ParamBag. Having the variable also called $params makes debugging a nightmare. Could be called $contents or whatever else.
| * @license http://opensource.org/licenses/gpl-2.0.php GNU General Public License | ||
| * @link https://vufind.org | ||
| */ | ||
| class ParamBagBag extends ParamBag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this both extends ParamBag and will typically contain one as an actual param. That's confusing, but it keeps all the method signatures the same.
Also, ParamBag was (I think) exclusively used to ultimately generate request query parameters, whereas now it's being used to generate a JSON body. But again it seemed like "param" doesn't have to mean URL param.
Also, in theory I could have just changed the existing ParamBag to support values that are not strings -- the typing is confusing -- some methods limit it while some don't. But I didn't want to create a breaking change on such a widely used class.
Finally, the name is kind of silly, happy to change. NestedParamBag seemed wrong since this is not the nested one; it's the parent of a nested one. ParamBagThatAlsoContainsParamBagsSometimes is...worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this problem be solved with interfaces? I'd have to spend a little time analyzing how all the different backends use ParamBags, but one idea might be to have a generic ParamBagInterface that's implemented by a more specific QueryParamBag that limits to things that can be represented by query parameters... and then there could be a JsonParamBag that adds this better nesting support. And then we could type with interfaces in generic places and use more specific types where appropriate.
As I say, that's just a very broad sketch of an idea, and I'd need to do more study to come up with a more thoughtful idea -- but if you think it's a general direction worth exploring, let me know and I'll be happy to discuss further when I'm back in the office.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An interface might help (and even then it might still be something that both inherits and contains the basic ParamBag, as in QueryGroup).
But in addition to the two new issues I'm trying to address with the new class (n-dimensional nesting, output as JSON), I'm also not sure if the original intent of ParamBag (duplicate param names) is still needed. The JSON spec seems to allow-but-discourage this. If the Solr API will never need duplicate param names, and we're reasonably sure any other API needing nested params (currently 0) will never need it either, then this new ParamBagBag can just store its contents in an n-deep associative array and not inherit any of the original ParamBag behavior. So then the interface is even more important, because the two implementations would be totally separate.
But I think duplicate keys are still needed in Solr, at least there are some parameters (fields, filter, hl, stats, shards, boost) I have to investigate. So I don't want to simplify ParamBag that way until I get all the functionality working and see if it's then possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, in theory I could have just changed the existing ParamBag to support values that are not strings -- the typing is confusing -- some methods limit it while some don't. But I didn't want to create a breaking change on such a widely used class.
From what I see the typing is just informal, i.e. via PHPDoc. I would rather remove this restriction than having an extra class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addendum: Of course then we need do some work on the ParamBag::request().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I confirmed that duplicate param names are definitely still needed -- and I have logic now to serialize that into an array, as Solr expects. So forget the idea of no longer using the base ParamBag functionality.
So, reasons discussed to extend ParamBag and/or create a shared interface:
- The need for json encoding. But that's not a good reason, we could just have the original ParamBag implement JsonSerializable::jsonSerialize() in addition to the existing
request. - The need to encapsulate other ParamBags. This requires a lot of methods, but doesn't actually mess with anything ParamBag is doing now. However as @dmj says we'd have to adjust
ParamBag::request()to support it. - The question of whether ParamBag's informal string typing is actually relied on anywhere, even though it's just PHPDocs for now. I'd hate to break that if so.
- The question of whether we'll have to support the legacy query parameters in addition to the json encoding, discussed here, and that requirement may come up in relation to DefaultParametersListener also. So I haven't implemented that yet, but it would be a pretty confusing addition to ParamBagBag with two internal sets of items, one to output in each method.
I don't know if any of those arguments are convincing, alone or together, and if not I can certainly add all that logic to the core ParamBag.
|
I considered implementing this with a flag to revert to the old method (query parameters) as needed. But the code changes are so significant that it would have been a real pain to support both. |
|
Basic functionality including basic faceting now works with the JSON APIs. Lots more to go. @demiankatz @sturkel89 I will eventually need a lot of testing help on this, there are a lot of VuFind features that use Solr that I've haven't used personally before so I can change the API but I don't really know what to expect, unless the integration tests are 100%. It's not ready yet though ... lots of TODOs in the code. I'm off on vacation until Jan 12th. Happy new year! |
| * | ||
| * @return array | ||
| */ | ||
| protected function jsonObject($items) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not implement JsonSerializable? That makes the intend clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thank you for the suggestion!
| public function json(): string | ||
| { | ||
| $jsonObject = $this->jsonObject($this->items); | ||
| return json_encode($jsonObject); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should deal with JSON errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. To start, I changed this to throw the error, which seems to be the usual practice in our codebase. Open to more specific handling if folks thing there is a useful fallback.
| if (!isset($parts[1])) { | ||
| continue; | ||
| } | ||
| // TODO This will need some config changes to know how to nest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending #5013 first.
| ); | ||
| $fields = $this->getFields($shards); | ||
| $specs = $this->getSearchSpecs($fields); | ||
| $this->backend->getQueryBuilder()->setSpecs($specs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic below is just a first draft, I don't like how it breaks the ParamBag abstraction completely in order to use array_filter. Either have to build some sort of filtering into ParamBag itself or just create a new ParamBag on the fly instead.
Also, I will need assistance actually testing this with shards.
| if (!empty($fieldPrefix)) { | ||
| $facetSet["f.{$facetField}.facet.prefix"] = $fieldPrefix; | ||
| } | ||
| $fieldMatches = $this->getFacetMatchesForField($facetField); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new Facet API does not support regex matches of this kind. The facet API was introduced here in 2015. matches was introduced two years later and I guess never made it into the new API.
Options
- Remove support for that feature. Annoying. Prefix though is still supported.
- Don't use the new Facets API. Possible but not the goal.
- Use the new Facets API but try to use a legacy URL query parameter for this bit. Assuming it works (would be odd to configure most of a facet via JSON but this one aspect via a query parameter), that means extending ParamBag further to support a combination of json and URL params. Doable but ugly.
- Contribute this as an enhancement to Solr. Certainly seems possible but will take some ramp-up on Solr and then waiting for a release.
- Discussing elsewhere with the Solr community. A fair place to start certainly but timing will still be slow.
| * | ||
| * @return array | ||
| */ | ||
| protected function getDefaultSearchParams(): array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the TreeDataSource changes are tested yet; I don't have hierarchy data.
| if (!empty($fieldMatches)) { | ||
| $facetSet["f.{$facetField}.facet.matches"] = $fieldMatches; | ||
| } | ||
| // TODO Figure out date range field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need some help please understanding how to test this. In my testing, $this->facetConfig seems to exclude the pubDate facet thanks to logic in SideFacets, so the code below never gets executed. I see also @EreMaijala 's recent PR #4479 but I don't think that's involved. What am I missing?
|
I think I've reached a stopping point, pending answers to some of the questions above. |
| $params->setNested('params', 'spellcheck', 'false'); | ||
| $explainParams = new ParamBagBag([ | ||
| 'fields' => 'id,score', | ||
| 'facet' => 'true', // This field will need an update when I do the facet API upgrade |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting on #5017
| 'params' => new ParamBag([ | ||
| 'debug' => 'true', | ||
| 'indent' => 'true', | ||
| 'param' => 'query', // Is this change correct? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting on #5017
| $backendParams->addNested('params', 'facet.pivot', $pf); | ||
| $backendParams->setNested('params', 'facet', 'true'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must test this
Solr has a "new" (a decade old?) JSON-based query APIs, with JSON in the request body instead of a long series of query parameters. This new API in turn supports more complex faceting.
TODO