[Recipes] Bucket Aggregations - Terms Aggregation

Problem:

Demo bucket aggregations - terms.

Solution Summary:

Bucket aggregations create buckets of documents and put documents into those buckets based on some criteria.

Prerequisites:

Set up accounts index from accounts.json as explained in the link.

Solution Steps:

Case 1 - Terms Aggregation with Many Buckets

GET accounts/_search
{
"aggs" : {
"state_terms" : {
"terms" : {
"field":"state.keyword"
}
}
},
"size": 0
}

Response contains:

"aggregations": {
"state_terms": {
"doc_count_error_upper_bound": 20,
"sum_other_doc_count": 770,
"buckets": [
{
"key": "ID",
"doc_count": 27
},
{
"key": "TX",
"doc_count": 27
},
{
"key": "AL",
"doc_count": 25
},
{
"key": "MD",
"doc_count": 25
},
{
"key": "TN",
"doc_count": 23
},
{
"key": "MA",
"doc_count": 21
},
{
"key": "NC",
"doc_count": 21
},
{
"key": "ND",
"doc_count": 21
},
{
"key": "ME",
"doc_count": 20
},
{
"key": "MO",
"doc_count": 20
}
]
}
}

Note:

Elastic search only returns the top unique keys' buckets. Sum of other bucket docs are given as "sum_other_doc_count".
The coordinating node coordinates among the shared in an index and sends the result for a query. The shards themselves will send data for top n rows based on configuration. Hence document counts may be approximate as explained here.
1. The value for "doc_count_error_upper_bound" represents the maximum potential document count for a term which did not make it into the final list of terms. This is calculated as the sum of the document count from the last term returned from each shard.
2. We can also enable per bucket document count error by setting show_term_doc_count_error parameter to true. With this setting every bucket will now have doc_count_error_upper_bound.
3. However accuracy can be improved by compremising on performance.

Case 2 - Terms Aggregation with Lesser Buckets

GET accounts/_search
{
"aggs" : {
"state_terms" : {
"terms" : {
"field":"opening_date"
}
}
},
"size": 0
}

Note: opening_date has only 5 distinct values and were added here.

Response contains:
"aggregations": {
"state_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1514764800000,
"key_as_string": "2018/01/01 00:00:00",
"doc_count": 2
},
{
"key": 1517702400000,
"key_as_string": "2018/02/04 00:00:00",
"doc_count": 2
},
{
"key": 1520553600000,
"key_as_string": "2018/03/09 00:00:00",
"doc_count": 2
},
{
"key": 1523836800000,
"key_as_string": "2018/04/16 00:00:00",
"doc_count": 2
},
{
"key": 1527206400000,
"key_as_string": "2018/05/25 00:00:00",
"doc_count": 2
}
]
}
}

Case 3 - Put missing values into a bucket with a default key

GET accounts/_search
{
"aggs" : {
"state_terms" : {
"terms" : {
"field":"opening_date",
"missing": "2017/12/31"
}
}
},
"size": 0
}

Response contains:

"aggregations": {
"state_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1514678400000,
"key_as_string": "2017/12/31 00:00:00",
"doc_count": 990
},
{
"key": 1514764800000,
"key_as_string": "2018/01/01 00:00:00",
"doc_count": 2
},
{
"key": 1517702400000,
"key_as_string": "2018/02/04 00:00:00",
"doc_count": 2
},
{
"key": 1520553600000,
"key_as_string": "2018/03/09 00:00:00",
"doc_count": 2
},
{
"key": 1523836800000,
"key_as_string": "2018/04/16 00:00:00",
"doc_count": 2
},
{
"key": 1527206400000,
"key_as_string": "2018/05/25 00:00:00",
"doc_count": 2
}
]
}
}

Case 4 - Set minimum count for buckets

GET accounts/_search
{
"aggs" : {
"state_terms" : {
"terms" : {
"field":"opening_date",
"missing": "2017/12/31",
"min_doc_count": 3
}
}
},
"size": 0
}

Response contains:

"aggregations": {
"state_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1514678400000,
"key_as_string": "2017/12/31 00:00:00",
"doc_count": 990
}
]
}
}

Note: Buckets with less than min_doc_count will not be returned. Default value for min_doc_count is 1. So by default, buckets with no documents are not returned (unless you set min_doc_count as 0).