Skip to content

Conversation

@mbaudis
Copy link
Member

@mbaudis mbaudis commented Jul 17, 2025

This is a refresh for adding an aggregation response to Beacon, based on @gsfk 's original #237 but as a restart after various discussions there, in #238 and at offline events (e.g. McGill workshop on 2025-07-09).

The main points of the PR are

  • reinstate aggregate granularity
  • create beaconAggregationResponse as a dedicated response type for "only aggregated data" responses
  • create a beaconAggregationResults section (list w/ AggregationResultsInstance objects) which is referenced as responseAggregation (for beacon overall responses) or resultsAggregation (inside ResultSet instances)
  • add responseAggregation to beaconResultsetsResponse and resultsAggregation to ResultSet instances
  • create an aggregationTerms framework query parameter and a aggregationTerms definition
  • create /aggregation_terms endpoint and beaconAggregationTermsResults definition

This is:

gsfk and others added 18 commits May 23, 2025 14:34
It seems necessary to add a separate granularity level for aggregate data responses?!
Since this aggregate ... implementation uses standard Beacon requests and responses, counts should always reflect the entry type of the response; i.e. it should be the count of the `individuals` having `biosamples` with a certain histology if using `.../individuals/?...`. While certainly _possible_ it seems pretty confusing to have different entities counted. Also it is not intuitive since the response entity indicates _what_ should be reported on.
This commit modifies the AggregateResultsInstance to accommodate for different options:

* ontology terms as aggregation terms, where the id corresponds to the value and only the count of its occurrences is relevant
* alphanumeric types, where an id (e.g. age, sex...) can have different values (represented in a `distribution` object)

Also, some prototype definitions for `distribution` are provided:
* a `values` list
* a `distincts` list
* an `items` list where objects can be provided

Examples are given. Other `DistributionItem` properties could be added (e.g. `average` ...). Also, the structure is open for some changes (e.g. just use a generic object instead of the `items` array, defined through `patternProperties` or such).
This commit

* removes the `aggregated` granularity again
* changes much of the wording and naming from "aggragation" to "summary"
    - this leaves "aggregate..." where it still makes sense (i.e. a summary from aggregating over items...)
    - might be incomplete or overdoing it - testing here
* changed "summaryTerms" (now....) to be just of type string, not of necessarily corresponding to filter ids (though those are documented)
…_change_measurement_to_measure"

This reverts commit 559bfe4, reversing
changes made to d227d24.
Updating main from develop, milestone 2.2.0
update add-aggregation-response from main
* `beaconAggregationTermsResults ...
* `/aggregation_terms` endpoint and `beaconAggregationTermsResults` definition
* responseAggregation into resultSetsResponse
* re-adding responseAggregation into resultSetsResponse
* (re)adding responseAggregation and resultsAggregation
@mbaudis mbaudis requested review from Copilot, dbujold, gsfk and jrambla July 17, 2025 14:07
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for aggregated responses in Beacon by reintroducing an aggregate granularity, defining new response types, and exposing an endpoint to discover available aggregation terms.

  • Reinstates aggregate granularity and introduces beaconAggregationResponse and beaconAggregationResults sections.
  • Adds aggregationTerms query parameter, defines aggregationTerms schema, and creates /aggregation_terms endpoint.
  • Renames individual model property measurements to measures in schemas and examples.

Reviewed Changes

Copilot reviewed 31 out of 35 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
models/src/beacon-v2-default-model/individuals/examples/individual-MID-example.yaml Rename measurements to measures in example
models/src/beacon-v2-default-model/individuals/defaultSchema.yaml Rename measurements to measures in schema
framework/src/requests/aggregationTerms.yaml Add aggregationTerms array schema and documentation
framework/src/responses/sections/beaconAggregationResults.yaml Define AggregationResultsInstance schema and examples
framework/src/endpoints.yaml Add /aggregation_terms endpoint configuration
Comments suppressed due to low confidence (4)

framework/src/responses/sections/beaconAggregationResults.yaml:55

  • The schema references summaryTerms.yaml, which does not exist. Update the $ref to point to the newly added aggregationTerms.yaml (e.g., ../../requests/aggregationTerms.yaml#/$defs/AggregationTerm).
        $ref: ../../requests/summaryTerms.yaml#/$defs/AggregationTerm

framework/src/responses/sections/beaconAggregationResults.yaml:89

  • The additionalItems keyword only applies to arrays; for an object schema you should use additionalProperties: true if you intend to allow extra fields.
    additionalItems: True

framework/src/responses/sections/beaconAggregationResults.yaml:86

  • Moving oneOf under required is invalid in JSON Schema. You should place a top-level oneOf sibling to properties to enforce that either count or report is present.
      - oneOf:

framework/src/endpoints.yaml:145

  • Consider adding unit or integration tests for the new /aggregation_terms endpoint to verify successful responses and error handling.
  /aggregation_terms:

mbaudis added 2 commits July 17, 2025 15:15
This adds a `maturity: draft` label at the root of the aggregation documents. This does not account for:
* maybe it should only be in the root documents
* some of general schemas have now properties relying on the draft schemas
Copy link
Collaborator

@gsfk gsfk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just back at work and having a first look at this now.

Granularity

My original proposal didn't introduce an extra granularity, largely because I liked the idea of moving toward a more uniform response, and because you'd previously convinced me it wasn't needed. If, as we seem to have discussed, the general idea is that any granularity at all can return beaconResultsetsResponse, this could be a lot more clear in the spec.

From what I can see this pr allows summary/aggregation response from
beaconAggregationResponse and beaconResultsetsResponse but not from beaconCountResponse and beaconBooleanResponse. Implicitly for an implementer this looks like "count response can't send summary stats"... even though record granularity can.

Are there advantages to adding an extra granularity that we don't get by simply adding an optional responseAggregation field to all responses?

How do I request an aggregated response?

  1. request granularity:aggregated & aggregation terms
  2. request granularity:record & aggregation terms

what about these?:

  1. request granularity:aggregated without aggregation terms
  2. request granularity:record without aggregation terms

Should I expect some beacon-dependent summary stats from 3 and 4? Or just from 3?

It's clear that the PR considers this an error:

  1. request granularity:count & aggregation terms

... unless the idea is that the count granularity can return ResultsetResponse instead of BeaconCountResponse.

Aggregation terms stringification

One of the persistent issues with filtering terms is a lack of spec or guidelines on how to convert alphanumeric filters (defined only as objects) to string, since some areas of the spec assume filters are strings. So I'm happy to see we're not recreating this issue with aggregation terms and suggesting a stringification right away. A few comments though:

  • the stringification fix is only in the description and the example, not actually the spec
  • it uses a walrus operator style colon (sex:=male instead of sex=male) which I doubt is uncontroversial
  • ... shouldn't we fix this for filtering terms too?

Aggregation terms "report"

If I understand the description in aggregationTerms.yaml we still allow users to request a general sex overview without having to specify particular sexes in the request. But in beaconAggregationResults.yaml there is only one example of a "report" style response and it's quite complex. I understand that report style is an object and otherwise not specified, but if it's still possible to get something like {sex: {male: 123, female: 456}} I would suggest adding a simpler example.

Misc issues

  • There are a few broken references to beaconSummaryResults.yaml, this file doesn't exist.
  • The comment "aggregate bools for entire query" in beaconBooleanResponseSection.yaml no longer applies (my previous pr allowed for bool-only aggregate stats, which is admittedly a little weird)

Finally, there were a couple things I didn't understand at all, although I suspect are not particularly relevant:

  • why is "measurements" changed to "measures"?
  • I don't understand any of the changes to files in /bin... in fact I don't understand why these files aren't gitignored. Aren't most of these just artifacts of docs generation?

Copy link

@zykonda zykonda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me: the aggregation is applied to the final record set and has a clean separation from all other beacon concepts and items.

@mbaudis
Copy link
Member Author

mbaudis commented Jul 23, 2025

@zykonda Thanks for the confirmation!
@gsfk Trying to answer your points:

My original proposal didn't introduce an extra granularity, largely because I liked the idea of moving toward a more uniform response, and because you'd previously convinced me it wasn't needed. If, as we seem to have discussed, the general idea is that any granularity at all can return beaconResultsetsResponse, this could be a lot more clear in the spec.

I think this is/was a viable solution but after discussions and opinion from @jrambla it seems clearer to do it the other way around: specific granularity -> response instead of universal aggregation option.

From what I can see this pr allows summary/aggregation response from
beaconAggregationResponse and beaconResultsetsResponse but not from beaconCountResponse and beaconBooleanResponse. Implicitly for an implementer this looks like "count response can't send summary stats"... even though record granularity can.

Yes. Granularity is also sed to indicate security levels and an aggregation might expose more than a count. And we had modified beaconResultsetsResponse to allow any granularity (e.g. for mixed granularities, aggregators or "counts per dataset" etc.) so result objects for aggregated should be there, too.

Now one change which has to be added: it should be

oneOf:
  - response
  - responseAggregation

... but this needs some JSON Schema refinement (i.e. an optional element where you report either the record level results OR their aggregation but not both). IMO not strictly necessary but makes sense (or we can just document it).

Are there advantages to adding an extra granularity that we don't get by simply adding an optional responseAggregation field to all responses?

Yes, seems so:

  • aggregated granularity was originally planned for Beacon but never really fleshed out
  • as above, if you can add aggregations everywhere you might go beyond "understood" implications of count

How do I request an aggregated response?

  1. request granularity:aggregated & aggregation terms

Yes.

  1. request granularity:record & aggregation terms

No; record is record ... Although a beacon always can downgrade the granularity and deliver an aggregated granularity (at resultSet level or globally).

what about these?:

  1. request granularity:aggregated without aggregation terms

Yes. A beacon can decide to have a default aggregation response.

  1. request granularity:record without aggregation terms

No but see above.

Should I expect some beacon-dependent summary stats from 3 and 4? Or just from 3?

3

It's clear that the PR considers this an error:
5. request granularity:count & aggregation terms

Yes.

... unless the idea is that the count granularity can return ResultsetResponse instead of BeaconCountResponse.

We have recently changed the required fields so that resultsetResponses are "count" or "boolean" compatible. So a resultsetResponse is just a format but there is no real relation between this format and the granularity. We had discussed things like "booleanResultsetResponse" ... but this clashes w/ the concept of mixed responses. Also IMO booleanResultSet ... was on the table and one could specify at this level instead of at the property level:

  ResultsetInstance:
  	oneOf:
  	  - $ref: '#/$defs/BooleanResultSet'
  	  - $ref: '#/$defs/CountResultSet'
  	  - $ref: '#/$defs/AggregatedResultSet'
  	  - $ref: '#/$defs/RecordResultSet'

... but this could still be done w/o breaking stuff (I'm in favour here but don't want to push this w/o being prompted).

Aggregation terms stringification

One of the persistent issues with filtering terms is a lack of spec or guidelines on how to convert alphanumeric filters (defined only as objects) to string, since some areas of the spec assume filters are strings. So I'm happy to see we're not recreating this issue with aggregation terms and suggesting a stringification right away. A few comments though:

  • the stringification fix is only in the description and the example, not actually the spec
  • it uses a walrus operator style colon (sex:=male instead of sex=male) which I doubt is uncontroversial
  • ... shouldn't we fix this for filtering terms too?

Yes, should be defined globally. And I'm for := instead of = since for me this is actually : & =, where the colon is an additional "prefix separator", making this similar (and similarly parseable) to a CURIE; a "prefix" indicating the scope, and a "local part" composed of comparator & value which together indicate the match (in a CURIE it such as NCIT:C3602 it is also NCIT & =C3602)

Aggregation terms "report"

If I understand the description in aggregationTerms.yaml we still allow users to request a general sex overview without having to specify particular sexes in the request. But in beaconAggregationResults.yaml there is only one example of a "report" style response and it's quite complex. I understand that report style is an object and otherwise not specified, but if it's still possible to get something like {sex: {male: 123, female: 456}} I would suggest adding a simpler example.

Sure; you could add that in principle but this particular example would be better a standard one:

- id: sex:=male
  count: 123
- id: sex:=female
  count: 456

A better record example would be

- id: sexPerDisease
  description: Sex distribution per cancer type
  record:
    - id: NCIT:C4872
      label: Breast Carcinoma
      counts:
        sex:=female: 836
        sex:=male: 3
    - id: NCIT:C3513
      label: Esophageal Carcinoma
      counts:
        sex:=female: 36
        sex:=male: 81

Misc issues

  • There are a few broken references to beaconSummaryResults.yaml, this file doesn't exist.

Thanks - removed.

  • The comment "aggregate bools for entire query" in beaconBooleanResponseSection.yaml no longer applies (my previous pr allowed for bool-only aggregate stats, which is admittedly a little weird)

Gone.

Finally, there were a couple things I didn't understand at all, although I suspect are not particularly relevant:

  • why is "measurements" changed to "measures"?

Zombie from some merging of not fully updated branch I guess.

  • I don't understand any of the changes to files in /bin... in fact I don't understand why these files aren't gitignored. Aren't most of these just artifacts of docs generation?

Please Ignore ...

additional aggregation "record" example
There was a zombie reversal of measurements => measures, probably due to some early branch forking. Fixed here to align w/ main & develop.
@gsfk
Copy link
Collaborator

gsfk commented Jul 23, 2025

Thanks for the response, this is mostly clear, with a few comments:

  • We should update the docs and the description field for BeaconResultsetResponse to make clear that granularity does not necessarily tie you to a particular response shape (in a separate pr).

And a couple comments on the "report" examples

  • The new example uses "record" instead of "report", is this a mistake?

  • The new example seems to be "multidimensional" (shows breakdowns for disease & sex together) which is fine, but my understanding was that we were focusing on unidimensional summaries for the moment and leaving other issues for later.

  • Also the examples still give no guidance for implementers on how to represent an ordinary distribution, eg, they are left choosing between:

    {
      "sex": {
        "report": {
          "male": 123,
          "female": 456
        }
      }
    }
    {
      "sex": {
        "report": {
          "counts": {
            "male": 123,
            "female": 456
          }
        }
      }
    }
    {
      "sex": {
        "report": {
            "sex:=male": 123,
            "sex:=female": 456
        }
      }
    }

    .... and so on. Or is the idea that these should be individual entries?

      {
          "count": 13,
          "entity": "individual",
          "id": "sex:=female",
          "label": "female sex at birth"
      }

    ... and so on for male, etc. This is fine for sex but not for disease, where there can be hundreds of entries.

@mbaudis
Copy link
Member Author

mbaudis commented Jul 23, 2025

We should update the docs and the description field for BeaconResultsetResponse to make clear that granularity does not necessarily tie you to a particular response shape (in a separate pr).

Sure. This might need a real schema / workflow diagram ...

The new example uses "record" instead of "report", is this a mistake?

Well... Typing on the train etc ... Fix coming.

Also the examples still give no guidance for implementers on how to represent an ordinary distribution, eg, they are left choosing between:

No. Well. The example shows:

  • count objects per term
    - id: sex:=female
      label: female sex at birth
      entity: individual
      count: 13

... and a report (which so far does not impose a schema) where there is jus a custom list of secondary values - and YMMV. (explicitly says "sex per NCIT cancer code (custom format)"). It might be better to use than arbitrary keys? But I don't want to document strange ones which implementers then might adopt ...

@gsfk
Copy link
Collaborator

gsfk commented Sep 2, 2025

Sex is probaby not the best example for me to use, since one of my concerns is how to give a distribution for a category with no clear ontology mappings.

The other obvious worry is that leaving the syntax of reports open will mean that everyone implements it differently, when presumably for clients or beacon networks we will want them as similar as possible.

Copy link
Contributor

@jrambla jrambla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'm missing example documents in the examples-sections and ezamples-fulldocuments folders.
  2. Adding the measurements > measures change here is not appropriate and not necessary
  3. Adding "maturity" just in this PR has not been discussed
  4. The "report" option needs a deeper definition of the purpose and a review of the approach

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't remove the part of the description that mentions details about the record granularity

Copy link
Member Author

@mbaudis mbaudis Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this refers to? Also, file is not part of the schema but generated.

Btw: last changes are 71f4045

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't remove the part of the description that mentions details about the record granularity

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagree. Data management strategy does not fit here; it is part of the model to define which fields are required and general amelioration strategies (including pagination, limit of content...) belong to a general documentation.

BTW: Code chgange comments should be made against .../src. not .../json.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. There is a reason for adding "maturity" just in this PR?
  2. Where is the expression syntax explained? (e,g, "ageAtDiagnosis:<=P17Y")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Good point. While this is a very \good first case for introducing (sub-) schema/feature maturity levels this should probably happen in a separate PR.
  2. Nowhere since it follows the stringified filter concept https://docs.genomebeacons.org/filters/#alphanumerical-value-queries ; pointer can be added.

Copy link
Member Author

@mbaudis mbaudis Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update:

  1. Removed in 71f4045
  2. Added pointer to the documentation here (future commit). However, we should add this definition (stringified, valued alphanumeric filter format) also to the filteringTerms schema since there is some ambiguous use in spite of the documentation on the website.

* removes the `maturity: draft` labels
* fixes the zombie appearances of old generated doc parts outside of the schema (e.g. `measures`...); not really relevant since those will be generated but to avoid confusion...
@mbaudis mbaudis force-pushed the add-aggregation-response branch from 93a374b to 71f4045 Compare September 17, 2025 09:23
@mbaudis
Copy link
Member Author

mbaudis commented Sep 17, 2025

@gsfk @dbujold @jrambla Based on the discussions yesterday I've done a number of clean-up and content changes; see the comments above.

The main point (besides some naming and description fixes) is the re-introduction of distribution responses for "concept" style aggregations. Here, a "concept" is typically an attribute in the model which can have multiple values; so this would be similar to the "filter id" in an alphanumeric filter1; in contrast to "valued filters" (i.e. CURIEs or id+operator+value alphanumerics) which should respond w/ a count aggregation. So we have now:

          The aggregation term id, wich could be a

          * valued filtering term (`ageAtDiagnosis:<=P17Y`, `NCIT:C8936`)
          * a concept indicated through its attribute's model path (`biosample.histologicalDiagnosis.id`)
            or corresponding to the `id` of an alphanumeric filter
          * a custom value

          In principle those different formats could be bound to specific
          types:
          * valued filtering term => `count`
          * concept => `distribution`
          * custom => `report`

In principle we could ditch the valued term => count; but then would lose the simplest aggregation cases and e.g. one could not simply pick a CURIE to get counts on but would need the concept; and if we only use valued terms then we lose the simpe wrapping of value distributions for a given concept (something that @gsfk and @jrambla emphasized).

Note: In previous iterations we had also different options for distributions, e.g. mean or range ... this might need to be added for numeric values (pretty easy as specific properties to DistributionInstance).

Note: I've squashed & force-pushed the distribution related commits so now this is 9682bed.

Another note: Reminder that the distribution part re-instates what @gsfk started with, with some changes ¯\_(ツ)_/¯

Footnotes

  1. The comparison to filters is for understanding but not as an explicit reference; while aggregation ids can (should mostly?) correspond to filter ids ... we probably should not bind the concepts.

This commit:

* adds distributions to aggregation results, as object with one or more key:value pairs (values limited to integer ATM)
* changes so far ill defined "entity" in aggregation response objects to `entryTypeId` (and defines this)
* introduces `concept` as aggregation type and documents it as model path (in simple dot annotation)
* adding distribution examples

There are additional notes (with a TODO) about concept / distribution use.
@mbaudis mbaudis force-pushed the add-aggregation-response branch from cbd74f3 to 9682bed Compare September 17, 2025 14:47
This aligns the aggregationTerms definition for requests with the format of filteringTerms definition (object with id instead of string).
Comment on lines 33 to 38
},
"report": {
"$ref": "#/$defs/Report"
}
},
"required": [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty sure this should look like this:

  "required": ["id"],
  "oneOf": [
    { "required": [ "count" ] },
    { "required": [ "distribution" ] },
    { "required": [ "report" ] }
  ],

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed in af8aae5

@gsfk
Copy link
Collaborator

gsfk commented Oct 1, 2025

If we are doing aggregation through its own granularity, shouldn't we make something like "available granularities" discoverable?

@mbaudis
Copy link
Member Author

mbaudis commented Oct 1, 2025

If we are doing aggregation through its own granularity, shouldn't we make something like "available granularities" discoverable?

@gsfk Good point (but not specifically related to this change). ATM only the default granularity is in securityAttributes.defaultGranularity. IMO availableGranularities could just be added to the info response (or configuration) - maybe create an issue or even PR?

This fixes required / oneOf in beaconAggregationResults (thanks to @gsfk !).
This represents the first pass at rewriting the aggregation response according to:

* "Proposal A" in the discussion #238 (comment)
* the following requirement checks, e.g. 1 and 2-dimensional data summaries used in the various dashboards (posted in the discussion)
* the live implementation in Progenetix, now both powering a general data dashboard as well as dynamically generated summaries after searches (e.g. try the "CNV Example" in https://progenetix.org/search/ and remove the Glioblastoma filter ... takes 1-2mins).

The examples have been moved to a separate document.

A separate documentation page will describe more details.
* collapse `AggregationConcept` `scope`+`property` to `property: scope.property` to avoid confusion with the scope of the reporting
This changes the aggregation terms response to have a `results/aggregationTerms` property - similar to `filteringTerms`; in contrast to the previous simple listing of terms in `results`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants