Overview of Analysis Process and Analyzers in Elasticsearch

An inverted index is a data structure that consists of a list of all unique words and list of the documents in which it appears. New documents are analyzed and then stored as inverted indexes, to allow very fast full-text searches.

Analysis is the process of converting the text into tokens and normalizing tockens before adding them into an inverted index. When we do a full text search, we search the inverted index rather than on the actual documents. So both the indexed text and the query string must be analyzed.

All fields that have the "text" type are analyzed (normalized while storing and searching) and hence you may not get exaxt matches. If you need exact matches, you may use the "keyword" data type instead, in which case, analysis is not done and exact matches are made.

 

Analyzers

Analysis is done by Analyzers. Analyzers consist of three components (or phases):

  1. Zero or more character filters

    1. Purpose of character filters are to manipulate the tex before tokenization.

    2. Examples of character filters include html_strip, mapping and pattern_replace.

  2. Tokenizer

    1. Primary purpose of tokenizer is to split text into terms.

    2. Tokenizers belong to one of the following categories: word oriented tokenizers (e.g. letter tokenizer), partial word tokenizer (e.g. N-Gram tokenizer) and structured text tokenizers (e.g. path tokenizer).

  3. Token filters

    1. Primary purpose of token filters is to mainipulate terms before adding them to the inverted index.

    2. Token filter examples are lowercase, uppercase, n-gram, stop token filters etc.

 

Adding analyzers to existing index involves: close index, add analyzer and reopen index.

 

Built-in analyzers include standard, simple, stop, language, keyword, pattern and whitespace. These have different combinations of character filters, tokenizers and token filters. We can also create custom analyzers. 

 

Additional Notes on Analyzers

  1. An inverted index is added per text field.

  2. We can use the _analyze API to update analysis configuration.

  3. An explicit stop word analyzer may not be requered by latest ES versions as the algorithms take care of it relevance out of the box.

Learn Serverless from Serverless Programming Cookbook

Contact

Please first use the contact form or facebook page messaging to connect.

Offline Contact
We currently connect locally for discussions and sessions at Bangalore, India. Please follow us on our facebook page for details.
WhatsApp (Primary): (+91) 7411174113
Phone (Escalations): (+91) 7411174114

Business newsletter

Complete the form below, and we'll send you an e-mail every now and again with all the latest news.

About

CloudMaterials is my blog to share notes and learning materials on Cloud and Data Analytics. My current focus is on Microsoft Azure and Amazon Web Services (AWS).

I like to write and I try to document what I learn to share with others. I believe that knowledge is useless unless you share it; the more you share, the more you learn.

Recent comments

Photo Stream