To understand the overall programming architecture using Map Reduce API
Practical - 1
To understand the overall programming architecture using Map Reduce API.
To understand the overall programming architecture using Map Reduce API.
Download Practical
Job Class
Constructors
Methods
Mapper
Class
Method
Reducer
Class
Method
Hadoop MapReduce is
the heart of the Hadoop system. It provides all the capabilities you need to break big data into manageable chunks,
process the data in
parallel on your distributed cluster, and then make the data available for user
consumption or additional processing.
The MapReduce algorithm contains two
important tasks, namely Map and Reduce.
·
The Map task takes a set of data and converts it into another
set of data, where individual elements are broken down into tuples (key-value
pairs).
·
The Reduce task takes the output from the Map as an input and
combines those data tuples (key-value pairs) into a smaller set of tuples.
The reduce task is always performed
after the map job.
Figure 1 MapReduce Architecture
Map
reduce architecture consists of mainly two processing stages. First one is the
map stage and the second one is reduce stage. The actual MR process happens in
task tracker. In between map and reduce stages, Intermediate process will take
place. Intermediate process will do operations like shuffle and sorting of the
mapper output data. The Intermediate data is going to get stored in local file
system.
Map stage: The map or mapper’s job is to
process the input data. Generally the input data is in the form of file or
directory and is stored in the Hadoop file system (HDFS). The input file is
passed to the mapper function line by line. The mapper processes the data and
creates several small chunks of data.
Reduce stage: This stage is the combination of
the Shuffle stage and the Reduce stage. The
Reducer’s job is to process the data that comes from the mapper. After
processing, it produces a new set of output, which will be stored in the HDFS.
·
During a MapReduce job, Hadoop sends the Map and Reduce tasks
to the appropriate servers in the cluster.
·
The framework manages all the details of data-passing such as
issuing tasks, verifying task completion, and copying data around the cluster
between the nodes.
·
Most of the computing takes place on nodes with data on local
disks that reduces the network traffic.
·
After completion of the given tasks, the cluster collects and
reduces the data to form an appropriate result, and sends it back to the Hadoop
server.
MapReduce API
Here are the classes and their
methods that are involved in the operations of MapReduce programming. We will
primarily keep our focus on the following −
- JobContext Interface
- Job Class
- Mapper Class
- Reducer Class
JobContext Interface
·
The JobContext interface is the super interface for
all the classes, which defines different jobs in MapReduce. It gives you a
read-only view of the job that is provided to the tasks while they are running.
·
The following are the sub-interfaces of JobContext
interface.
S.No.
|
Subinterface Description
|
1.
|
MapContext<KEYIN, VALUEIN,
KEYOUT, VALUEOUT>
Defines the context that is
given to the Mapper.
|
2.
|
ReduceContext<KEYIN,
VALUEIN, KEYOUT, VALUEOUT>
Defines the context that is
passed to the Reducer.
|
·
Job class is the main class that implements the
JobContext interface.
Job Class
The Job class is the
most important class in the MapReduce API. It allows the user to configure the
job, submit it, control its execution, and query the state. The set methods
only work until the job is submitted; afterwards they will throw an
IllegalStateException.
Normally, the user
creates the application, describes the various facets of the job, and then
submits the job and monitors its progress.
Constructors
Following are the constructor
summary of Job class.
S.No
|
Constructor
Summary
|
1
|
Job()
|
2
|
Job(Configuration
conf)
|
3
|
Job(Configuration
conf, String jobName)
|
Methods
Some of the important methods of Job
class are as follows −
S.No
|
Method
Description
|
1
|
getJobName()
User-specified job name.
|
2
|
getJobState()
Returns the current state of the
Job.
|
3
|
isComplete()
Checks if the job is finished or
not.
|
4
|
setInputFormatClass()
Sets the InputFormat for the job.
|
5
|
setJobName(String
name)
Sets the user-specified job name.
|
6
|
setOutputFormatClass()
Sets the Output Format for the job.
|
7
|
setMapperClass(Class)
Sets the Mapper for the job.
|
8
|
setReducerClass(Class)
Sets the Reducer for the job.
|
9
|
setPartitionerClass(Class)
Sets the Partitioner for the job.
|
10
|
setCombinerClass(Class)
Sets the Combiner for the job.
|
Mapper
Class
The Mapper class
defines the Map job. Maps input key-value pairs to a set of intermediate
key-value pairs. Maps are the individual tasks that transform the input records
into intermediate records. The transformed intermediate records need not be of
the same type as the input records. A given input pair may map to zero or many
output pairs.
Method
map is the most
prominent method of the Mapper class. The syntax is defined below −
map(KEYIN key, VALUEIN value,org.apache.hadoop.mapreduce.Mapper.Context context)
This method is called once for each
key-value pair in the input split.
Reducer
Class
The Reducer class defines the Reduce
job in MapReduce. It reduces a set of intermediate values that share a key to a
smaller set of values. Reducer implementations can access the Configuration for
a job via the JobContext.getConfiguration() method. A Reducer has three primary
phases − Shuffle, Sort, and Reduce.
·
Shuffle − The Reducer copies the
sorted output from each Mapper using HTTP across the network.
·
Sort − The framework merge-sorts the
Reducer inputs by keys (since different Mappers may have output the same key).
The shuffle and sort phases occur simultaneously, i.e., while outputs are being
fetched, they are merged.
·
Reduce − In this phase the reduce
(Object, Iterable, Context) method is called for each <key, (collection of
values)> in the sorted inputs.
Method
reduce is the most
prominent method of the Reducer class. The syntax is defined below –
reduce(KEYIN key,Iterable<VALUEIN> values,org.apache.hadoop.mapreduce.Reducer.Context context)
This method is called once for each key on the
collection of key-value pairs.
Comments
Post a Comment