Hello! Welcome to the part-2 of “How search engine works” series. Previous part was all about the basic concepts that go behind a search engine at a low level. If you want to refer the previous part, feel free and read it here. This part will be focussed on how things work in elastic search and how to get started with it.
What is elastic search ? (version – 5.6)
It is – highly scalable, open-source full-text search and analytics engine. Let’s understand what these fancy bold letter words actually mean :
- Highly scalable : In the current world there are lots and lots of data sources which are just emitting data at the speed of light. Adding to the complexity there are different types of data as well. To serve search requests around this huge data set requires highly scalable implementation of search algorithms. Which means that, as the data grows your search system can easily grow (number of nodes) and keep up with the time taken to serve search requests.
- Open source : Which means the code is open for all. Open source projects have their own advantages. Any one can contribute to the features, improvements and bug resolution. There is usually a huge community support for open source projects, which helps developers to get started very quickly and resolve any issues that they face while developing.
- Search and analytics : You can search a bunch of full-text documents for keywords and also apply analytics on the data sets. Elastic search is capable of providing near real time search and analytics results on huge volumes of data. More details coming later in the blog.
Use cases where elastic search can be a fit:
- Catalog search for retail websites.
- Provide autocomplete features.
- Analysing, aggregating, parsing large amounts of log, event data in production systems which are processing huge data and generating logs at a very high rate.(ELK stack)
- If you have huge applications like stackoverflow, quora etc, where you want to search the answers provided by some users on some specific topics.
The basics :
- Node : A server which is part of a group known as cluster. It stores data and takes part in the clusters activities like indexing and searching.Nodes exist inside a cluster. They use cluster names to join.By default they join a cluster named “elasticsearch“.
- Cluster : A group or collection of nodes which holds all your data and provide indexing and search capabilities on your data. Cluster has a name which is used as an identifier for nodes to join them. This is important to separate your clusters for different environments.
- Index : A collection of documents which have logical relation, belong to similar category for example data related to your customers can be grouped to have an index. For catalog data you may have another index and so on. You may think something similar to schema in RDBMS. It has a name as an identifier (all lowercase)
- Type : A type can be imagined something like tables in RDBMS which store data which have common fields. This is used to partition data based on their structure.For example for a retail website catalog can be an index which stores catalog data, then you can have a type “products” for product data, “price” for pricing related data etc.
- Documents : Refers to the minimum unit of data which can be indexed.Like rows in table. This is represented in JSON format which is a well known standard for data exchange on the internet.
- Shards : As mentioned earlier, elastic search is primarily used when you have huge data set. Now when you create an index for subset of your data, it is quite possible that the index size may fill up the disk space on a single node. It can take up to TBs of data which might be larger than available disk space on a single node. What do you do in such cases ? Shards is the answer. Shards are parts of your index, which can be stored on separate nodes but can still exist under one index. When you define an index you can define the number of shards. Each shard act as a separate index but logically connected a parent index. This enables horizontal scaling of your content and also parallelise tasks across shards.
- Replicas : Now, as we storing some set of data across nodes, there is a risk that if node dies you will lose your data and that’s the last thing you want. To avoid that you can use replication. Replicating data provides fault tolerance and high availability in turn. Replica is never stored on the same node (obviously) , but is stored on some other node in the cluster. So when one node dies, you still have your data. So now you will have primary shard and replica shards for the same data set.
Elastic search requires Java 8 to be installed (at least the latest version). So make sure you have that. You can install elastic search using following :
-> Download the tar. curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.3.tar.gz -> Extract: tar -xvf elasticsearch-5.6.3.tar.gz -> Change directory to its binary: cd elasticsearch-5.6.3/bin -> Start the node and single cluster: ./elasticsearch -> On mac it is even simpler if you use homebrew: brew install elasticsearch
This will create a cluster named “elasticsearch” (default) and a node with a random generated UUID. If you want to override that you can simply provide additional arguments while starting the cluster :
./elasticsearch -Ecluster.name=<<your cluster name>> -Enode.name=<<Your node name>>
Access your cluster:
Now that you have your cluster up and running with a node, you would like to access it. Elastic search provides a rich set of RESTful apis to do that.By default it runs of port:9200
Now for hitting Rest APIs Postman is a great UI tool. Download postman or use plain curl command if that works for you.
Using _cat api we can check the health of our cluster.
As a response you will see your cluster listed with its status, number of nodes, shards etc.
So with this you have your elastic search cluster up and running, In the part-3 of the article we will store our own data into our es clusters, index them and perform some complex queries.
I hope you liked the article, please share your feedback for any corrections, complements and improvements. Stay tuned for Part-3.