Inception of MapReduce in large scale web computing


In this article, I have highlighted how MapReduce came into play, why it is useful and how does it improve the performance and speed of web scale computing.

MapReduce is a programming model with implementation, meant for generating and processing enormous volume of data in the form of data sets. In this, the user is required to specify a characteristic map function along with a reduce function. Many real world jobs/tasks can be expressed using this model. The map function disseminates work to the different nodes present across distributed cluster. Reduce function collates all work and produces result in a single value. This framework has been designed to be fault tolerant. Every node in the distributed cluster reports back after regular time intervals with the status of work being done at its end.

Also Read: How to use Caret Browsing on Microsoft Edge

Why did MapReduce gain momentum?


Enumerated below are the reasons that made MapReduce popular:

1. Parallel programming is quite a difficult task. Through MapReduce, programmers can succinctly, efficiently and with ease, express the computational operations and is expensive enough to fulfill the requirements.

2. It is designed to embed the file system into parallel computer and yields performance benefits in terms of load balancing and exploiting the locality between data and computations.

3. Extreme-scale infrastructure for web, vast number of components and massive data for storage. Such a resource extensive system needs to automatically recover from failures and tolerate all kinds of server and software faults.

How does MapReduce work?


MapReduce is a model that allows processing that includes map and reduce operations to be carried out on distributed data. The pre requisite for carrying this out is that each mapping operation has to be independent of others so that the processes can run in parallel. Although, this practice usually suffers from a limitation of number of data sources which are independent and the number of processing units present near each source. Consequently, the reduction process should be associative, carried out by a set of reducers. The condition for reduction operation is that the outputs that share the same key should be presented to the reducer at same time.

Following are the hotspots that play the key role in MapReduce:

Input Reader: It fragments the consolidated input into splits of appropriate size (ranging between 16-128 MB). Each map function is assigned a single section of input. The data is read from a stable storage which is mostly resides on a distributed file system and it generates output in the form of key/value pairs.

Map function: The output generated by the input reader is fed into the map function which does the processing and further produces zero or more key/value pairs as the output. Often, the input and output are of forms, different from each other.

Partition function: Map function's output is allotted to a specific reducer with the help of partition function for the purpose of sharding. The input to partition function is key and the reducers. The reduced output is returned with the corresponding index.

Comparison function: Each reduce function's input is pulled out from the machine where Map was executed and then the output is sorted with the help of comparison function in the application.

Output writer: It writes the output to a stable storage.


Comments

No responses found. Be the first to comment...


  • Do not include your name, "with regards" etc in the comment. Write detailed comment, relevant to the topic.
  • No HTML formatting and links to other web sites are allowed.
  • This is a strictly moderated site. Absolutely no spam allowed.
  • Name:
    Email: