Some background on how ElasticSearch indexes documents
For those who already have a back ground on Elastic Search, it is just a special purpose full text search document based data store with real time indexing abilities. How does Elastic Search does this? If you are already aware of the Search engines and the underlying data structure that’s exactly what Elastic Search does as well. Elastic Search creates an inverted index data structure. For some idea on inverted index refer this link. Basically what Elastic Search does when you insert a document is it creates or updates an existing inverted index of all the fields in the document. So potentially you could search the entire document using any of the fields with different querying features. Elastic Search take do an index as you go approach, it means as you insert or update documents Elastic Search can work on the inverted index in real time.
# Non-Hadoop Elastic Search Cluster (AWS EC2)
Lets now try and understand how ES handles scalability using horizontal shards. Here is a good resource for installing elastic search on AWS EC2. There is nothing different about ES automatic sharding of an index when compared to a Hash Table. Each and every document id automatically assigned or manually assigned is hashed to determine on which bucket or shard the document has to go. By default elastic search uses 5 shards with 5 replicas and can be overridden in the. yaml file as desired.
Lets walkthrough a simple example. I am assuming we have started n (here n=5) EC2 instances and used the above tutorial to create a simple ES cluster. Say we have created an index I1 with a type and started indexing documents. Based on the document id and hash function used by ES the documents are sharded across the 5 shards. The figure below shows a sample distribution of documents for index I1 into 3 shards. If you see S1 maps to document id ranges D1 to D10. S1 is shard1 (nothing but EC2 instance 1) and D1, D2 are document ids for simplicity.
Elastic Search router knows how to which shards the query should be routed for better performance.
# Hadoop Elastic Search Cluster (AWS EMR)
Lets walkthrough the same example with ES on Hadoop. ES comes with elasticsearch-hadoop Hadoop plugin for high performance queries (order of milli-seconds). Here are some of the promises of this plugin.
Shards play a critical role when reading information from ES. Since it acts as a source, elasticsearch-hadoop will create one Hadoop InputSplit per ES shard; that is given a query that works against index I1, elasticsearch-hadoop will dynamically discover the number of shards backing I1 in this case S1, S2 and S3 and then for each shard will create an input split (which will determine the number of Hadoop tasks to be executed). In our case 3 M/R tasks
if you have a requirement for a high performant distributed search capability elastic search on Hadoop is the way to go.