Introduction to Elasticsearch

Submitted by heartin on Sat, 05/19/2018 - 21:10

Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as an engine that powers applications that have complex search features and requirements. (source = elastic.co).

Elasticsearch is based on ApacheLucene, which is a free and open-source information retrieval software library. It provides an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source. Official clients are available in Java, .NET (C#), PHP, Python, Apache Groovy, Ruby and many other languages. (source=wikipedia).

According to the DB-Engines ranking, Elasticsearch is the most popular enterprise search engine followed by Apache Solr, also based on Lucene. (source=wikipedia).

Important Concepts - Nodes, Cluster, Document, Index, Shards

A node is a single server that stores your data. Nodes belong to a cluster.
A cluster contains one or more nodes and provides indexing and search capabilities across all nodes.
A document is the basic unit of information in elasticsearch.
Documents are grouped into indexes. i.e. index is a collection of documents.
Each index can be split into multiple shards. Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster.

Additional Points:

Node
1. A node is identified by a name which by default is a random Universally Unique IDentifier (UUID) that is assigned to the node at startup. You can define any node name you want.
2. A node can be configured to join a specific cluster by the cluster name. By default, each node is set up to join a cluster named elasticsearch.
3. In a single cluster, you can have as many nodes as you want. If there are no other Elasticsearch nodes currently running on your network, starting a single node will by default form a new single-node cluster named elasticsearch.
Cluster
1. A cluster is identified by a unique name which by default is "elasticsearch".
2. A node can only be part of a cluster if the node is set up to join the cluster by its name.
3. It is valid and perfectly fine to have a cluster with only a single node in it. You may also have multiple independent clusters each with its own unique cluster name.
4. Better don’t reuse the same cluster names in different environments (e.g. logging-dev, logging-stage, and logging-prod).
5. Each node in an ElasticSearch clustercan belong to one of these roles: master, data, injest. This can be configured in config/elasticsearch.yml file in the ElasticSearch distribution.
Index
1. An index is a collection of documents, split into shards to distribute data evenly among cluster nodes.
2. An index is identified by a name (all lowercase) and this name is used for performing indexing, search, update, and delete operations against the documents in it.
3. In a single cluster, you can define as many indexes as you want.
Document
1. A document is a basic unit of information that can be indexed. For example, you can have a document for a single customer, another document for a single product, and yet another for a single order.
2. Documents are expressed as key-value string data serialized as json.
3. Within an index/type, you can store as many documents as you want. Note that although a document physically resides in an index, a document actually must be indexed/assigned to a type inside an index.
Shards & Replicas
1. Once an index is replicated, each index will have primary shards (replicated from) and replica shards (the copies).
2. The number of shards and replicas can be defined per index at the time the index is created. After the index is created, you may change the number of replicas dynamically anytime but you cannot change the number of shards later. By default, each index in Elasticsearch is allocated 5 primary shards and 1 replica.
3. The mechanics of sharding is completely managed by Elasticsearch and is transparent to you as the user.
4. Replication provides high availability in case a shard/node fails. It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.
5. The shards are actually a lucene index behind the scenes. Each Elasticsearch shard is a Lucene index. There is a maximum number of documents you can have in a single Lucene index. As of LUCENE-5843, the limit is 2,147,483,519 (= Integer.MAX_VALUE - 128) documents.
Near Realtime (NRT)
1. Elasticsearch is a near real time search platform: there is a slight latency (normally one second) from the time you index a document until the time it becomes searchable.
Query DSL
1. Elasticsearch provides a full Query DSL, based on JSON, to define queries.
2. Query clauses behave differently depending on whether they are used in query context or filter context.
Type (Deprecated)
1. Within an index, you could define one or more types. A type is a logical category/partition of your index. Each type within an index could have different
2. Type is deprecated in ES 6. It is no longer possible to create multiple types in an index, and the whole concept of types will be removed in a later version. See Removal of mapping types.