-
Notifications
You must be signed in to change notification settings - Fork 22
Add aggregation response #259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
It seems necessary to add a separate granularity level for aggregate data responses?!
Since this aggregate ... implementation uses standard Beacon requests and responses, counts should always reflect the entry type of the response; i.e. it should be the count of the `individuals` having `biosamples` with a certain histology if using `.../individuals/?...`. While certainly _possible_ it seems pretty confusing to have different entities counted. Also it is not intuitive since the response entity indicates _what_ should be reported on.
This commit modifies the AggregateResultsInstance to accommodate for different options: * ontology terms as aggregation terms, where the id corresponds to the value and only the count of its occurrences is relevant * alphanumeric types, where an id (e.g. age, sex...) can have different values (represented in a `distribution` object) Also, some prototype definitions for `distribution` are provided: * a `values` list * a `distincts` list * an `items` list where objects can be provided Examples are given. Other `DistributionItem` properties could be added (e.g. `average` ...). Also, the structure is open for some changes (e.g. just use a generic object instead of the `items` array, defined through `patternProperties` or such).
Update from Schema urgent fixes
This commit
* removes the `aggregated` granularity again
* changes much of the wording and naming from "aggragation" to "summary"
- this leaves "aggregate..." where it still makes sense (i.e. a summary from aggregating over items...)
- might be incomplete or overdoing it - testing here
* changed "summaryTerms" (now....) to be just of type string, not of necessarily corresponding to filter ids (though those are documented)
Updating main from develop, milestone 2.2.0
…eacon/beacon-v2 into add-aggregation-response
update add-aggregation-response from main
* `beaconAggregationTermsResults ... * `/aggregation_terms` endpoint and `beaconAggregationTermsResults` definition * responseAggregation into resultSetsResponse * re-adding responseAggregation into resultSetsResponse * (re)adding responseAggregation and resultsAggregation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for aggregated responses in Beacon by reintroducing an aggregate granularity, defining new response types, and exposing an endpoint to discover available aggregation terms.
- Reinstates
aggregategranularity and introducesbeaconAggregationResponseandbeaconAggregationResultssections. - Adds
aggregationTermsquery parameter, definesaggregationTermsschema, and creates/aggregation_termsendpoint. - Renames individual model property
measurementstomeasuresin schemas and examples.
Reviewed Changes
Copilot reviewed 31 out of 35 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| models/src/beacon-v2-default-model/individuals/examples/individual-MID-example.yaml | Rename measurements to measures in example |
| models/src/beacon-v2-default-model/individuals/defaultSchema.yaml | Rename measurements to measures in schema |
| framework/src/requests/aggregationTerms.yaml | Add aggregationTerms array schema and documentation |
| framework/src/responses/sections/beaconAggregationResults.yaml | Define AggregationResultsInstance schema and examples |
| framework/src/endpoints.yaml | Add /aggregation_terms endpoint configuration |
Comments suppressed due to low confidence (4)
framework/src/responses/sections/beaconAggregationResults.yaml:55
- The schema references
summaryTerms.yaml, which does not exist. Update the$refto point to the newly addedaggregationTerms.yaml(e.g.,../../requests/aggregationTerms.yaml#/$defs/AggregationTerm).
$ref: ../../requests/summaryTerms.yaml#/$defs/AggregationTerm
framework/src/responses/sections/beaconAggregationResults.yaml:89
- The
additionalItemskeyword only applies to arrays; for an object schema you should useadditionalProperties: trueif you intend to allow extra fields.
additionalItems: True
framework/src/responses/sections/beaconAggregationResults.yaml:86
- Moving
oneOfunderrequiredis invalid in JSON Schema. You should place a top-leveloneOfsibling topropertiesto enforce that eithercountorreportis present.
- oneOf:
framework/src/endpoints.yaml:145
- Consider adding unit or integration tests for the new
/aggregation_termsendpoint to verify successful responses and error handling.
/aggregation_terms:
This adds a `maturity: draft` label at the root of the aggregation documents. This does not account for: * maybe it should only be in the root documents * some of general schemas have now properties relying on the draft schemas
gsfk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just back at work and having a first look at this now.
Granularity
My original proposal didn't introduce an extra granularity, largely because I liked the idea of moving toward a more uniform response, and because you'd previously convinced me it wasn't needed. If, as we seem to have discussed, the general idea is that any granularity at all can return beaconResultsetsResponse, this could be a lot more clear in the spec.
From what I can see this pr allows summary/aggregation response from
beaconAggregationResponse and beaconResultsetsResponse but not from beaconCountResponse and beaconBooleanResponse. Implicitly for an implementer this looks like "count response can't send summary stats"... even though record granularity can.
Are there advantages to adding an extra granularity that we don't get by simply adding an optional responseAggregation field to all responses?
How do I request an aggregated response?
- request
granularity:aggregated& aggregation terms - request
granularity:record& aggregation terms
what about these?:
- request
granularity:aggregatedwithout aggregation terms - request
granularity:recordwithout aggregation terms
Should I expect some beacon-dependent summary stats from 3 and 4? Or just from 3?
It's clear that the PR considers this an error:
- request
granularity:count& aggregation terms
... unless the idea is that the count granularity can return ResultsetResponse instead of BeaconCountResponse.
Aggregation terms stringification
One of the persistent issues with filtering terms is a lack of spec or guidelines on how to convert alphanumeric filters (defined only as objects) to string, since some areas of the spec assume filters are strings. So I'm happy to see we're not recreating this issue with aggregation terms and suggesting a stringification right away. A few comments though:
- the stringification fix is only in the description and the example, not actually the spec
- it uses a walrus operator style colon (sex:=male instead of sex=male) which I doubt is uncontroversial
- ... shouldn't we fix this for filtering terms too?
Aggregation terms "report"
If I understand the description in aggregationTerms.yaml we still allow users to request a general sex overview without having to specify particular sexes in the request. But in beaconAggregationResults.yaml there is only one example of a "report" style response and it's quite complex. I understand that report style is an object and otherwise not specified, but if it's still possible to get something like {sex: {male: 123, female: 456}} I would suggest adding a simpler example.
Misc issues
- There are a few broken references to
beaconSummaryResults.yaml, this file doesn't exist. - The comment "aggregate bools for entire query" in
beaconBooleanResponseSection.yamlno longer applies (my previous pr allowed for bool-only aggregate stats, which is admittedly a little weird)
Finally, there were a couple things I didn't understand at all, although I suspect are not particularly relevant:
- why is "measurements" changed to "measures"?
- I don't understand any of the changes to files in /bin... in fact I don't understand why these files aren't gitignored. Aren't most of these just artifacts of docs generation?
zykonda
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me: the aggregation is applied to the final record set and has a clean separation from all other beacon concepts and items.
|
@zykonda Thanks for the confirmation!
I think this is/was a viable solution but after discussions and opinion from @jrambla it seems clearer to do it the other way around: specific granularity -> response instead of universal aggregation option.
Yes. Granularity is also sed to indicate security levels and an aggregation might expose more than a count. And we had modified Now one change which has to be added: it should be ... but this needs some JSON Schema refinement (i.e. an optional element where you report either the record level results OR their aggregation but not both). IMO not strictly necessary but makes sense (or we can just document it).
Yes, seems so:
Yes.
No; record is record ... Although a beacon always can downgrade the granularity and deliver an
Yes. A beacon can decide to have a default aggregation response.
No but see above.
3
Yes.
We have recently changed the required fields so that resultsetResponses are "count" or "boolean" compatible. So a resultsetResponse is just a format but there is no real relation between this format and the granularity. We had discussed things like "booleanResultsetResponse" ... but this clashes w/ the concept of mixed responses. Also IMO ... but this could still be done w/o breaking stuff (I'm in favour here but don't want to push this w/o being prompted).
Yes, should be defined globally. And I'm for
Sure; you could add that in principle but this particular example would be better a standard one: A better
Thanks - removed.
Gone.
Zombie from some merging of not fully updated branch I guess.
Please Ignore ... |
additional aggregation "record" example
There was a zombie reversal of measurements => measures, probably due to some early branch forking. Fixed here to align w/ main & develop.
|
Thanks for the response, this is mostly clear, with a few comments:
And a couple comments on the "report" examples
|
Sure. This might need a real schema / workflow diagram ...
Well... Typing on the train etc ... Fix coming.
No. Well. The example shows:
... and a |
... according to @gsfk
|
Sex is probaby not the best example for me to use, since one of my concerns is how to give a distribution for a category with no clear ontology mappings. The other obvious worry is that leaving the syntax of reports open will mean that everyone implements it differently, when presumably for clients or beacon networks we will want them as similar as possible. |
jrambla
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I'm missing example documents in the examples-sections and ezamples-fulldocuments folders.
- Adding the measurements > measures change here is not appropriate and not necessary
- Adding "maturity" just in this PR has not been discussed
- The "report" option needs a deeper definition of the purpose and a review of the approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't remove the part of the description that mentions details about the record granularity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what this refers to? Also, file is not part of the schema but generated.
Btw: last changes are 71f4045
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't remove the part of the description that mentions details about the record granularity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disagree. Data management strategy does not fit here; it is part of the model to define which fields are required and general amelioration strategies (including pagination, limit of content...) belong to a general documentation.
BTW: Code chgange comments should be made against .../src. not .../json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- There is a reason for adding "maturity" just in this PR?
- Where is the expression syntax explained? (e,g,
"ageAtDiagnosis:<=P17Y")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Good point. While this is a very \good first case for introducing (sub-) schema/feature maturity levels this should probably happen in a separate PR.
- Nowhere since it follows the stringified filter concept https://docs.genomebeacons.org/filters/#alphanumerical-value-queries ; pointer can be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update:
- Removed in 71f4045
- Added pointer to the documentation here (future commit). However, we should add this definition (stringified, valued alphanumeric filter format) also to the
filteringTermsschema since there is some ambiguous use in spite of the documentation on the website.
* removes the `maturity: draft` labels * fixes the zombie appearances of old generated doc parts outside of the schema (e.g. `measures`...); not really relevant since those will be generated but to avoid confusion...
93a374b to
71f4045
Compare
|
@gsfk @dbujold @jrambla Based on the discussions yesterday I've done a number of clean-up and content changes; see the comments above. The main point (besides some naming and description fixes) is the re-introduction of In principle we could ditch the valued term => count; but then would lose the simplest aggregation cases and e.g. one could not simply pick a CURIE to get counts on but would need the concept; and if we only use valued terms then we lose the simpe wrapping of value distributions for a given concept (something that @gsfk and @jrambla emphasized). Note: In previous iterations we had also different options for distributions, e.g. mean or range ... this might need to be added for numeric values (pretty easy as specific properties to Note: I've squashed & force-pushed the distribution related commits so now this is 9682bed. Another note: Reminder that the Footnotes
|
This commit: * adds distributions to aggregation results, as object with one or more key:value pairs (values limited to integer ATM) * changes so far ill defined "entity" in aggregation response objects to `entryTypeId` (and defines this) * introduces `concept` as aggregation type and documents it as model path (in simple dot annotation) * adding distribution examples There are additional notes (with a TODO) about concept / distribution use.
cbd74f3 to
9682bed
Compare
This aligns the aggregationTerms definition for requests with the format of filteringTerms definition (object with id instead of string).
| }, | ||
| "report": { | ||
| "$ref": "#/$defs/Report" | ||
| } | ||
| }, | ||
| "required": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty sure this should look like this:
"required": ["id"],
"oneOf": [
{ "required": [ "count" ] },
{ "required": [ "distribution" ] },
{ "required": [ "report" ] }
],
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Fixed in af8aae5
|
If we are doing aggregation through its own granularity, shouldn't we make something like "available granularities" discoverable? |
@gsfk Good point (but not specifically related to this change). ATM only the default granularity is in |
This fixes required / oneOf in beaconAggregationResults (thanks to @gsfk !).
This represents the first pass at rewriting the aggregation response according to: * "Proposal A" in the discussion #238 (comment) * the following requirement checks, e.g. 1 and 2-dimensional data summaries used in the various dashboards (posted in the discussion) * the live implementation in Progenetix, now both powering a general data dashboard as well as dynamically generated summaries after searches (e.g. try the "CNV Example" in https://progenetix.org/search/ and remove the Glioblastoma filter ... takes 1-2mins). The examples have been moved to a separate document. A separate documentation page will describe more details.
* collapse `AggregationConcept` `scope`+`property` to `property: scope.property` to avoid confusion with the scope of the reporting
This changes the aggregation terms response to have a `results/aggregationTerms` property - similar to `filteringTerms`; in contrast to the previous simple listing of terms in `results`.
This is a refresh for adding an aggregation response to Beacon, based on @gsfk 's original #237 but as a restart after various discussions there, in #238 and at offline events (e.g. McGill workshop on 2025-07-09).
The main points of the PR are
aggregategranularitybeaconAggregationResponseas a dedicated response type for "only aggregated data" responsesbeaconAggregationResultssection (list w/AggregationResultsInstanceobjects) which is referenced asresponseAggregation(for beacon overall responses) orresultsAggregation(insideResultSetinstances)responseAggregationtobeaconResultsetsResponseandresultsAggregationtoResultSetinstancesaggregationTermsframework query parameter and aaggregationTermsdefinition/aggregation_termsendpoint andbeaconAggregationTermsResultsdefinitionThis is: