Introduction to Analysis Process in Elasticsearch

An inverted index is a data structure that consists of a list of all unique words and list of the documents in which it appears. New documents are analyzed and then stored as inverted indexes, to allow very fast full-text searches.

Analysis is the process of converting text into tokens and normalizing tokens (lowercase, stemming, stopwords removal etc.) before adding them into an inverted index. When we do a full text search, we search the inverted index rather than on the actual documents. So both the indexed text and the query string must be analyzed.

All fields that have the "text" type are analyzed (normalized while storing and searching) and hence you may not get exaxt matches. If you need exact matches, you may use the "keyword" data type instead, in which case, analysis is not done and exact matches are made.

 

Introduction to Analyzers

Analysis is done by Analyzers.

Analyzers consist of three components (or phases):

  1. Zero or more character filters

    1. Purpose of character filters are to manipulate the text before tokenization.

  2. Tokenizer

    1. Primary purpose of tokenizer is to split text into terms.

  3. Token filters

    1. Primary purpose of token filters is to mainipulate terms before adding them to the inverted index.

 

Adding analyzers to existing index involves: close index, add analyzer and reopen index.

 

We can use the _analyze API to update analysis configuration.

 

Note: We will discuss analysis in detail in the Beyond Basics book.

Learn Serverless from Serverless Programming Cookbook

Contact

Please first use the contact form or facebook page messaging to connect.

Offline Contact
We currently connect locally for discussions and sessions at Bangalore, India. Please follow us on our facebook page for details.
WhatsApp (Primary): (+91) 7411174113
Phone (Escalations): (+91) 7411174114

Business newsletter

Complete the form below, and we'll send you an e-mail every now and again with all the latest news.

About

CloudMaterials is my blog to share notes and learning materials on Cloud and Data Analytics. My current focus is on Microsoft Azure and Amazon Web Services (AWS).

I like to write and I try to document what I learn to share with others. I believe that knowledge is useless unless you share it; the more you share, the more you learn.

Recent comments

Photo Stream