So lets start with the steps involved in building fast search queries :
1) Lexical analysis for indexing
2) Removing stop words
3) Stemming
4) Synonyms
5) Persistence
6) Ranking
Lexical Analysis :
In this step the huge text is converted into group of words (also known as tokens).Which means that your text will be intelligentlyconverted into a subset which contains unique set of words on the basis of which the search will be performed.What next ?
Do you remember the index page at the back of a book that you love ? Just think for a while what it contains. Yes, it has a list of words followed by page numbers.Take a look at the below image if you don’t remember.
In search engine world this is known as an inverted index. So basically you have a reverse mapping of words to the page numbers. To understand this clearly, let’s take an example.
Suppose you have to search for the count of a particular word in a huge number of documents.How would you do that ? The easiest approach is to read every page of every document and increment your count when you find the word.After reading every document you would have your answer.
After some time you again have to refer to that word, but wait… Would you scan the documents again ? Lets say you do it, but what happens when the query comes again ? You can’t always read all the documents. Phew, this is very tiring and slow process. After thinking for a while, you come up with an idea to note down the page numbers for the words you encounter.Such that for the next query you just have to refer to the page numbers and jump directly on to that.
When you do this for all the words (that you really care about) you will have a document known as the Inverted Index.
So what is the idea – Create an index of all the important words and store it somewhere.
General definition – Inverted index is a mapping between “terms” (the actual content/words/tokens) and the “postings” (documents in which the word appear).
Removing stop words:
Generally in huge text documents there are lot of words which are not useful for your search queries.These words are known as stop words. For example a, an , is , the , to etc.These are not worth indexing, and if you do that you will end up growing your index size unnecessarily.So a list of stop words needs to be identified and maintained which is used while indexing . Any words which is a stop word will not be indexed.
Common stop-words include articles, prepositions and one-letter words.
Stemming :
Stemming generally refers to stripping off letters from words and bring them to their root form.For example fishing,fisherman,fisher,fishy all come from the word fish. This important step takes care of removing such redundancy.The idea here is that we cannot and should not store all the grammatical forms of a single word.The user who is searching might mean a different grammatical form of a word just by querying the stem word.More examples : “run” -> “running”,”runner” , “swim” -> “swimming”,”swims” etc.
Synonym Database (Lemmatization) :
A synonym database needs to be constructed to search on words with equivalent meanings.The implementation is exactly the same as stemming , the only difference is that keywords are mapped to the words with similar meaning as opposed to different grammatical forms of the keyword.Example : car -> vehicle, Automobile -> vehicle , plane -> vehicle. This gives user a capability to search actual words by querying the synonyms.
Persistence:
Persistent indices allow for quick retrieval of previously indexed information.
All the important information above a word is pre-processed as described in the previous steps and stored , which makes the information retrieval fast.
Relevance Ranking :
Keywords are assigned a relevance ranking, which is the calculation of a keyword’s frequency relative to the total number of words in the document.
This ensures that the search results are more relevant to what the user needs.
The relevance rank aids the end-user in locating the desired information by indicating which results (especially in a large result set) are more likely yield pertinent information.
After all the above steps, a search engine becomes capable of solving search queries in huge amount of data with improved user experience.
Next topic is to know about how Elastic search does it ? Leaving you with a small introduction, details will be covered in part-2 of this article.
Elastic search is a flexible, powerful , distributed real time search and analytics engine. It is easy to setup and provides following features :
- Real time analytics
- Distributed
- High availability
- Full text search
- Document oriented (Json in and Json out)
- Schema free (Json with free schema)
- Provide RESTful api
Cheers,
Kaivalya Apte
[…] go behind a search engine at a low level. If you want to refer the previous part, feel free and read it here. This part will be focussed on how things work in elastic search and how to get started with […]