Introduction to Analysis Process in Elasticsearch

Submitted by heartin on Sat, 05/12/2018 - 10:58

An inverted index is a data structure that consists of a list of all unique words and list of the documents in which it appears. New documents are analyzed and then stored as inverted indexes, to allow very fast full-text searches.

Analysis is the process of converting text into tokens and normalizing tokens (lowercase, stemming, stopwords removal etc.) before adding them into an inverted index. When we do a full text search, we search the inverted index rather than on the actual documents. So both the indexed text and the query string must be analyzed.

All fields that have the "text" type are analyzed (normalized while storing and searching) and hence you may not get exaxt matches. If you need exact matches, you may use the "keyword" data type instead, in which case, analysis is not done and exact matches are made.

Introduction to Analyzers

Analysis is done by Analyzers.

Analyzers consist of three components (or phases):

Zero or more character filters
1. Purpose of character filters are to manipulate the text before tokenization.
Tokenizer
1. Primary purpose of tokenizer is to split text into terms.
Token filters
1. Primary purpose of token filters is to mainipulate terms before adding them to the inverted index.

Adding analyzers to existing index involves: close index, add analyzer and reopen index.

We can use the _analyze API to update analysis configuration.

Note: We will discuss analysis in detail in the Beyond Basics book.

References:

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis...

https://en.wikipedia.org/wiki/Inverted_index

https://www.elastic.co/guide/en/elasticsearch/guide/current/inverted-ind...

heartin's blog
Log in or register to post comments

Contact

Please first use the contact form or facebook page messaging to connect.

Offline Contact
We currently connect locally for discussions and sessions at Bangalore, India. Please follow us on our facebook page for details.
WhatsApp (Primary): (+91) 7411174113
Phone (Escalations): (+91) 7411174114

About

CloudMaterials is my blog to share notes and learning materials on Cloud and Data Analytics. My current focus is on Microsoft Azure and Amazon Web Services (AWS).

I like to write and I try to document what I learn to share with others. I believe that knowledge is useless unless you share it; the more you share, the more you learn.

Photo Stream

Introduction to Analysis Process in Elasticsearch

Introduction to Analyzers

References:

Partners and Platforms

Learn Serverless from Serverless Programming Cookbook

Contact

Business newsletter

About

Partner Sites

Recent comments

Photo Stream

Introduction to Analysis Process in Elasticsearch

Introduction to Analyzers

References:

Partners and Platforms

Learn Serverless from Serverless Programming Cookbook

Contact

Follow us on:

Business newsletter

About

Partner Sites

Recent comments

Photo Stream

You are here