Architecture of Elasticsearch

Learn about Elasticsearch’s concepts and its difference from other databases.

Elasticsearch architecture

Elasticsearch is a distributed, open-source search and analytics engine designed for handling large volumes of data. Its unique architecture allows for real-time search and analytics, making it a popular choice for applications such as logging, full-text search, and business intelligence. One of the key features of Elasticsearch’s architecture is its use of a distributed data model, which allows for horizontal scalability and high availability.

Node

A node is a single instance (physical or virtual server) of Elasticsearch, and each node has its unique ID and name. Also, a node belongs to only one cluster.

Elasticsearch node
Elasticsearch node

Cluster

A cluster is a collection of one or multiple nodes distributed across separate machines that work together to accomplish a common goal. Nodes in each cluster are assigned to one or various roles, e.g., holding data.

Cluster visualization
Cluster visualization

Document

The document is a JSON format object stored in the Elasticsearch index under a unique ID. Here is an example:

Press + to interact
{
"_id": "1",
"name": "Pizza",
"price": 12.99,
"category": "Emtree",
"ingredients": [
"dough",
"sauce",
"cheese",
"toppings"
]
}

Index

An index is a collection of documents that have similar characteristics. For example, the documents below show dish and actor indices, which have two documents each.


Dish index
{
  "_id": "1",
  "name": "Pizza",
  "price": 12.99,
  "category": "Emtree",
  "ingredients": [
    "dough",
    "sauce",
    "cheese"
  ]
}
{
  "_id": "2",
  "name": "Spaghetti",
  "price": 10.99,
  "category": "Emtree",
  "ingredients": [
    "eggs",
    "bacon",
    "parmesan cheese",
    "pepper"
  ]
}
Actor index
{
    "_id": "1",
    "name": "leonardo dicaprio",
    "age": 48,
    "nationality": "American",
    "Occupations": [
        "actor",
        "film producer"
    ],
    "height": 1.83
}
{
    "_id": "2",
    "name": "John Smith",
    "age": 54,
    "nationality": "American",
    "Occupations": [
        "actor",
        "rapper",
        "film producer"
    ],
    "height": 1.88
}

Shard

A shard represents a partition of an index’s data. Each shard contains a subset of the index’s documents and is stored on a single node in the cluster. When we create an index, it is created with one shard by default, but we can configure it to have multiple shards that are distributed across different nodes.

Sharding is the process of dividing an index into smaller parts (shards) and distributing them across nodes in the cluster. This enables Elasticsearch to scale horizontally by adding more nodes to the cluster and allows for parallel processing of search requests across multiple shards.

The visualization below demonstrates the distribution of the dish and actor indices across three shards that are spread across three different nodes.

Sharding visualization
Sharding visualization

The following are the benefits of using sharding in Elasticsearch:

  • Improved performance: By distributing data across multiple shards, search and indexing operations can be parallelized, resulting in improved performance.

  • Increased capacity: Sharding allows Elasticsearch to scale horizontally, allowing it to handle larger volumes of data.

  • Improved flexibility: Sharding allows Elasticsearch to distribute data across different hardware or infrastructure, providing flexibility in terms of resource allocation.

Replicas

A replica is a copy of a shard. Replicas are used to provide high availability and improve fault tolerance. When a shard has one or more replicas, if the primary shard fails or becomes unavailable, one of the replicas can be promoted to primary to ensure that the index remains available for search and indexing operations.

Replicas can also be used to scale search performance. For example, when a search request is made, it can be executed on all shards (including replicas) in parallel, which improves search performance.

Replica visualization
Replica visualization

Elasticsearch vs. RDBMS

There are some similarities between Elasticsearch and traditional databases (RDBMS). The following points explain the similarities and differences between Elasticsearch and RDBMS terms:

  • Cluster vs. database: In RDBMS, databases are used interchangeably and represent a set of schemas and several tables. In Elasticsearch, the set of indices available is grouped in a cluster.

  • Index vs. table: The table is a collection of rows. On the other hand, an index is a collection of documents.

  • Document vs. row: Documents and rows represent data stored in the index and table, respectively. Rows tend to be more restricted, but documents are more flexible.

  • Field vs. column: Both fields and columns represent individual data attributes within a structured data model.

  • Shard: This is the same in both Elasticsearch and RDBMS.

  • Replica: This is the same in both Elasticsearch and RDBMS.

Elasticsearch vs. RDBMS
Elasticsearch vs. RDBMS