Skip to content
This repository was archived by the owner on Nov 7, 2018. It is now read-only.

Commit 47a38a4

Browse files
committed
Initial stats API documentation
1 parent 5b8db32 commit 47a38a4

File tree

1 file changed

+72
-1
lines changed

1 file changed

+72
-1
lines changed

Diff for: API.md

+72-1
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ This document explains:
55
* How to define and execute queries as URLs
66
* Refining query results using option parameters
77
* Extracting query results in JSON and CSV format
8+
* Generating aggregate data using statistics queries
89
* Detecting query errors
910

1011
## Introduction to Queries
@@ -18,10 +19,12 @@ Each query is expressed as a URL, containing:
1819
* The **API Version String**. Currently the only supported version string is: `v1`
1920
* The **Endpoint** representing a particular dataset, e.g. `schools`. Endpoint
2021
names are usually plural.
22+
* An optional **Query Type**, added to the Endpoint's path. Currently the only
23+
additional type is `stats`; see the section on [Statistics Queries](#statistics-queries) for more information.
2124
* The **Format** for the result data. The default output format is JSON ([JavaScript Object Notation](http://json.org/)); CSV is
2225
also available.
2326
* The **Query String** containing a set of named key-value pairs that
24-
represent the query, which incude
27+
represent the query, which include
2528
* **Field Parameters**, specifying a value (or set of values) to match
2629
against a particular field, and
2730
* **Option Parameters**, which affect the filtering and output of the
@@ -215,3 +218,71 @@ When the dataset includes a `location` at the root level (`location.lat` and
215218
* By default, any number passed in the `_distance` parameter is treated as a number of miles, but you can specify miles or kilometers by appending `mi` or `km` respectively.
216219
* Distances are calculated from the center of the given zip code, not the boundary.
217220
* Only U.S. zip codes are supported.
221+
222+
## Statistics Queries
223+
224+
The queries discussed so far are only capable of returning individual records and selected values from those records. However, it's also possible to generate aggregate data from a specified set of records by making use of Statistics Queries.
225+
226+
### Statistics Query Example
227+
228+
Here's an example statistics query URL:
229+
230+
```
231+
https://api.data.gov/ed/collegescorecard/v1/schools/stats?school.degrees_awarded.predominant=2,3&_fields=2013.student.size&_metrics=avg,sum,std_deviation,std_deviation_bounds
232+
```
233+
234+
In this statistics query URL:
235+
236+
* `/stats` is appended to the Endpoint. This is the key indicator that
237+
statistics should be returned instead of individual records.
238+
* `school.degrees_awarded.predominant=2,3` is a Field Parameter. In this case, it's searching for records which have a `school.degrees_awarded.predominant` value of either `2` or `3`. The aggregated statistics will be generated from this subset of records.
239+
* `_fields=2013.student.size` limits the aggregation to only operating over the `2013.student.size` field. Multiple fields can be specified and aggregated in a single query, but only those with numeric data can be used.
240+
* `_metrics` is an Option Parameter only available to statistics queries, and limits the kinds of aggregations performed. See below for more information.
241+
242+
This is the JSON document returned:
243+
244+
```json
245+
{
246+
"metadata": {
247+
"total": 3667,
248+
"page": 0,
249+
"per_page": 20
250+
},
251+
"results": [],
252+
"aggregations": {
253+
"school.tuition_revenue_per_fte": {
254+
"avg": "0.1088815711947627E5",
255+
"sum": 73288234,
256+
"std_deviation": "0.75913587304684015E4",
257+
"std_deviation_bounds": {
258+
"upper": "0.26070874580413074E5",
259+
"lower": "-0.4294560341460534E4"
260+
}
261+
}
262+
}
263+
}
264+
```
265+
266+
Note that the top-level elements returned by a statistics query differ from those returned by other kinds of queries:
267+
268+
* **`metadata`** provides the same information as it does in other queries.
269+
* **`total`** provides the number of records matching the query (in this case, all those schools with a `school.degrees_awarded.predominant` of 2 or 3). This is the subset of records from which the statistics are calculated.
270+
* **`page`** and **`per_page`** are irrelevant in statistics queries, and will likely be removed in a future version of the API.
271+
* **`results`** is always empty in statistics queries, and may be removed in a future version of the API.
272+
* **`aggregations`** contains a JSON Object for every field specified in the `_fields` parameter. Within these Objects there's an entry for every type of aggregation performed. In this case, use of the `_metrics` parameter has limited the returned aggregations to `avg`, `sum`, `std_deviation` and `std_deviation_bounds`. See below for more information.
273+
274+
### Specifying aggregations with `_metrics`
275+
276+
By default, the full set of available aggregations is calculated and returned for each field specified in the `_fields` parameter. These aggregations are calculated by ElasticSearch's [Extended Stats Aggregation](https://www.elastic.co/guide/en/elasticsearch/reference/1.7/search-aggregations-metrics-extendedstats-aggregation.html):
277+
278+
* `count`
279+
* `min`
280+
* `max`
281+
* `avg`
282+
* `sum`
283+
* `sum_of_squares`
284+
* `variance`
285+
* `std_deviation`
286+
* `std_deviation_bounds`
287+
288+
Each of these provides a single value, with the expection of `std_deviation_bounds`, which provides a JSON Object containing `upper` and `lower` bounds.

0 commit comments

Comments
 (0)