An inverted index is a data structure that consists of a list of all unique words and list of the documents in which it appears. New documents are analyzed and then stored as inverted indexes, to allow very fast full-text searches.
Analysis is the process of converting the text into tokens and normalizing tockens before adding them into an inverted index. When we do a full text search, we search the inverted index rather than on the actual documents. So both the indexed text and the query string must be analyzed.
All fields that have the "text" type are analyzed (normalized while storing and searching) and hence you may not get exaxt matches. If you need exact matches, you may use the "keyword" data type instead, in which case, analysis is not done and exact matches are made.
Analyzers
Analysis is done by Analyzers. Analyzers consist of three components (or phases):
-
Zero or more character filters
-
Purpose of character filters are to manipulate the tex before tokenization.
-
Examples of character filters include html_strip, mapping and pattern_replace.
-
-
Tokenizer
-
Primary purpose of tokenizer is to split text into terms.
-
Tokenizers belong to one of the following categories: word oriented tokenizers (e.g. letter tokenizer), partial word tokenizer (e.g. N-Gram tokenizer) and structured text tokenizers (e.g. path tokenizer).
-
-
Token filters
-
Primary purpose of token filters is to mainipulate terms before adding them to the inverted index.
-
Token filter examples are lowercase, uppercase, n-gram, stop token filters etc.
-
Adding analyzers to existing index involves: close index, add analyzer and reopen index.
Built-in analyzers include standard, simple, stop, language, keyword, pattern and whitespace. These have different combinations of character filters, tokenizers and token filters. We can also create custom analyzers.
Additional Notes on Analyzers
-
An inverted index is added per text field.
-
We can use the _analyze API to update analysis configuration.
-
An explicit stop word analyzer may not be requered by latest ES versions as the algorithms take care of it relevance out of the box.
- heartin's blog
- Log in or register to post comments
Recent comments