When to MapReduce ?

This is a 2 min read

Someone asked me for what problems MapReduce is not good for, I am flipping the question and answering what problems it works best for. Say, someone asked when to use recursive logic vs iterative logic, there is a bit of grey area there even though some problems clearly lend itself to recursion like graph traversal, there are other factors like readability and performance cost. But for MapReduce there is NO grey area if you see the use-cases you will see a clear pattern emerge.

Any simple repeatable problem (like counting words in a large corpus of text files) that need not be solved sequentially, lends itself nicely to MapReduce. You could go one file at a time and keep updating a <K,V> pair but a smarter way to do is to parallelize the counting say, one counting unit per file and merge the results to get total counts. MapReduce is nothing but a fancy term for this “parallelize and combine” technique. The keywords are “simple, repeatable and large”.  So automatically thinking MapReduce is for solving hard problems is a major misconception.

Use-cases 

In fact search engine companies invented it to parallelize the simple task of crawling through webpages to build a search index. Additional use cases can be found in this presentation.  One of it was the New York Times usecase, They converted 4TB of scanned articles to 1.5TB of PDF documents using 100 AWS EC2 instances in  24 hours.  Again, simple yet large problem that can be parallelized and combined.

Actually the conversation went into when to choose raw M/R, special purpose Data flow processors like Storm or Spark and MPP query engines like IMPALA or Drill  and for what uses cases they work best ?

That will be the subject of the next post. Stay tuned !