淺談 Elasticsearch 的定義

[摘錄自Elasticsearch Server, 2nd Edition第12頁]

The basics of Elasticsearch
Elasticsearch is an open source search server project started by Shay Banon and published in February 2010. During this time, the project has grown into a major player in the field of search and data analysis solutions and is widely used in many more or lesser-known search applications. In addition, due to its distributed nature and real-time capabilities, many people use it as a document store.

Elasticsearch是一個由Shay Banon起頭的開放原始碼專案,並且在2010年的二月進行公開。發展至今,Elasticsearch已經成為搜尋或是資料分析解決方案中不可或缺的重要參與者,並且被廣泛的使用在許許多多應用開發中。也因為它具有分散式以及即時處理的特性,許多使用者將他用於文件儲存使用。

Index
Index is the logical place where Elasticsearch stores logical data, so that it can be divided into smaller pieces. If you come from the relational database world, you can think of an index like a table. However, the index structure is prepared for fast and efficient full-text searching, and in particular, does not store original values. If you know MongoDB, you can think of the Elasticsearch index as a collection in MongoDB. If you are familiar with CouchDB, you can think about an index as you would about the CouchDB database. Elasticsearch can hold many indices located on one machine or spread over many servers. Every index is built of one or more shards, and each shard can have many replicas.

Index可視為一個邏輯的空間,Elasitcsearch用於存放邏輯的資料,也因為並非是原始資料,所以這些資料可以被分成許多小片段進行存放。如果你之前有使用過關聯式資料庫的經驗,可以把index想做是table的型態。然而,index的結構主要是為了在全文搜尋時能夠快速且有效地進行。如果你知道MonogoDB,則你可以將Elasticsearch index視為是MonogoDB的collection;又或是假如你知道CouchDB,則也可以將index視為是CouchDB的資料庫。Elasticsearch可以在一臺主機上同時管理許多index,又或是散布在許多台電腦上。每一個index是由一個或多個shards組成,而每一個shard又包含許多的複本抄寫。

Document
The main entity stored in Elasticsearch is a document. Using the analogy to relational databases, a document is a row of data in a database table. When you compare an Elasticsearch document to a MongoDB document, you will see that both can have different structures, but the document in Elasticsearch needs to have the same type for all the common fields. This means that all the documents with a field called title need to have the same data type for it, for example, string.
Documents consist of fields, and each field may occur several times in a single document (such a field is called multivalued). Each field has a type (text, number, date, and so on). The field types can also be complex: a field can contain other subdocuments or arrays. The field type is important for Elasticsearch because it gives information about how various operations such as analysis or sorting should be performed. Fortunately, this can be determined automatically (however, we still suggest using mappings). Unlike the relational databases, documents don’t need to have a fixed structure—every document may have a different set of fields, and in addition to this, fields don’t have to be known during application development. Of course, one can force a document structure with the use of schema. From the client’s point of view, a document is a JSON object (see more about the JSON format at http://en.wikipedia.org/wiki/JSON). Each document is stored in one index and has its own unique identifier (which can be generated automatically by Elasticsearch) and document type. A document needs to have a unique identifier in relation to the document type. This means that in a single index, two documents can have the same unique identifier if they are not of the same type.

Document type
In Elasticsearch, one index can store many objects with different purposes. For example, a blog application can store articles and comments. The document type lets us easily differentiate between the objects in a single index. Every document can have a different structure, but in real-world deployments, dividing documents into types significantly helps in data manipulation. Of course, one needs to keep the limitations in mind; that is, different document types can’t set different types for the same property. For example, a field called title must have the same type across all document types in the same index.

Mapping
In the section about the basics of full-text searching (the Full-text searching section), we wrote about the process of analysis—the preparation of input text for indexing and searching. Every field of the document must be properly analyzed depending on its type. For example, a different analysis chain is required for the numeric fields (numbers shouldn’t be sorted alphabetically) and for the text fetched from web pages (for example, the first step would require you to omit the HTML tags as it is useless information—noise). Elasticsearch stores information about the fields in the mapping. Every document type has its own mapping, even if we don’t explicitly define it.

Key concepts of Elasticsearch
Now, we already know that Elasticsearch stores data in one or more indices. Every index can contain documents of various types. We also know that each document has many fields and how Elasticsearch treats these fields is defined by mappings. But there is more. From the beginning, Elasticsearch was created as a distributed solution that can handle billions of documents and hundreds of search requests per second. This is
due to several important concepts that we are going to describe in more detail now.

Node and cluster
Elasticsearch can work as a standalone, single-search server. Nevertheless, to be able to process large sets of data and to achieve fault tolerance and high availability, Elasticsearch can be run on many cooperating servers. Collectively, these servers are called a cluster, and each server forming it is called a node.

Shard
When we have a large number of documents, we may come to a point where a single node may not be enough—for example, because of RAM limitations, hard disk capacity, insufficient processing power, and inability to respond to client requests fast enough. In such a case, data can be divided into smaller parts called shards (where each shard is a separate Apache Lucene index). Each shard can be placed on a different server, and thus, your data can be spread among the cluster nodes. When you query an index that is built from multiple shards, Elasticsearch sends the query
to each relevant shard and merges the result in such a way that your application doesn’t know about the shards. In addition to this, having multiple shards can speed up the indexing.

當我們有大量的文件時,可能會因為只有單一node而無法快速地回應使用者的請求,原因有可能是因為記憶體的限制,硬碟空間的限制或是處理的效能不夠等。有鑑於此,資料可以分成多個小塊,在這裡稱為shard,也就是切割後的apache luncene index。每一個shard可以放在不同的伺服器上,因此你的資料可以散布在叢集的node之中。當你正在查詢一個由多個shards組成的index時,Elasticsearch會傳送查詢給每一個shard持有的node並且將成果整併,在這過程中你的application不會知道Elasticsearch背後所進行的動作。因此,使用多個shards可以加速index的進行。

Replica
In order to increase query throughput or achieve high availability, shard replicas can be used. A replica is just an exact copy of the shard, and each shard can have zero or more replicas. In other words, Elasticsearch can have many identical shards and one of them is automatically chosen as a place where the operations that change the index are directed. This special shard is called a primary shard, and the others are called replica shards. When the primary shard is lost (for example, a server holding the shard data is unavailable), the cluster will promote the replica to be the new primary shard.

Gateway
Elasticsearch handles many nodes. The cluster state is held by the gateway. By default, every node has this information stored locally, which is synchronized among nodes. We will discuss the gateway module in The gateway and recovery modules section of Chapter 7, Elasticsearch Cluster in Detail.

This entry was posted in Elasticsearch. Bookmark the permalink.