An inverted index is a data structure that consists of a list of all unique words and list of the documents in which it appears. New documents are analyzed and then stored as inverted indexes, to allow very fast full-text searches.
Analysis is the process of converting text into tokens and normalizing tokens (lowercase, stemming, stopwords removal etc.) before adding them into an inverted index. When we do a full text search, we search the inverted index rather than on the actual documents. So both the indexed text and the query string must be analyzed.
All fields that have the "text" type are analyzed (normalized while storing and searching) and hence you may not get exaxt matches. If you need exact matches, you may use the "keyword" data type instead, in which case, analysis is not done and exact matches are made.
Introduction to Analyzers
Analysis is done by Analyzers.
Analyzers consist of three components (or phases):
-
Zero or more character filters
-
Purpose of character filters are to manipulate the text before tokenization.
-
-
Tokenizer
-
Primary purpose of tokenizer is to split text into terms.
-
-
Token filters
-
Primary purpose of token filters is to mainipulate terms before adding them to the inverted index.
-
Adding analyzers to existing index involves: close index, add analyzer and reopen index.
We can use the _analyze API to update analysis configuration.
Note: We will discuss analysis in detail in the Beyond Basics book.
- heartin's blog
- Log in or register to post comments
Recent comments