Study of HDFS

Practical - 10
Study of HDFS.


Download Practical



Introduction
The Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications.HDFS is a distributed file system that provides high-performance access to data across Hadoop clusters. Like other Hadoop-related technologies, HDFS has become a key tool for managing pools of big data and supporting big data analytics applications. HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing.


Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.


Goals of HDFS
Fault detection and recovery : Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.


Huge datasets : HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.


Hardware at data : A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.


HDFS Architecture
Given below is the architecture of a Hadoop File System. HDFS follows the master-slave architecture and it has the following elements.




Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks:
1. Manages the file system namespace.
2. Regulates client’s access to files.
3. It also executes file system operations such as renaming, closing, and opening files and directories.


Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system.
1. Datanodes perform read-write operations on the file systems, as per client request.
2. They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.


Block
Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.


Hadoop MapReduce architecture working


The execution of a MapReduce job begins when the client submits the job configuration to the Job Tracker that specifies the map, combine and reduce functions along with the location for input and output data. On receiving the job configuration, the job tracker identifies the number of splits based on the input path and select Task Trackers based on their network vicinity to the data sources. Job Tracker sends a request to the selected Task Trackers.


The processing of the Map phase begins where the Task Tracker extracts the input data from the splits. Map function is invoked for each record parsed by the “InputFormat” which produces key-value pairs in the memory buffer. The memory buffer is then sorted to different reducer nodes by invoking the combine function. On completion of the map task, Task Tracker notifies the Job Tracker. When all Task Trackers are done, the Job Tracker notifies the selected Task Trackers to begin the reduce phase. Task Tracker reads the region files and sorts the key-value pairs for each key. The reduce function is then invoked which collects the aggregated values into the output file.


Starting HDFS
Initially you have to format the configured HDFS file system, open namenode (HDFS server), and execute the following command.


$ hadoop namenode -format

After formatting the HDFS, start the distributed file system. The following command will start the namenode as well as the data nodes as cluster.


$ start-dfs.sh


Listing Files in HDFS


$ $HADOOP_HOME/bin/hadoop fs -ls <args>


Inserting Data into HDFS
Assume we have data in the file called file.txt in the local system which is ought to be saved in the hdfs file system. Follow the steps given below to insert the required file in the Hadoop file system.


Step 1
You have to create an input directory.


$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input


Step 2
Transfer and store a data file from local systems to the Hadoop file system using the put command.


$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input


Step 3
You can verify the file using ls command.


$ $HADOOP_HOME/bin/hadoop fs -ls /user/input


Retrieving Data from HDFS


Step 1
Initially, view the data from HDFS using cat command.


$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile


Step 2
Get the file from HDFS to the local file system using get command.


$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/


Shutting Down the HDFS
You can shut down the HDFS by using the following command.


$ stop-dfs.sh

Comments

Popular posts from this blog

Study of WEKA tool

Study of DB Miner Tool

Create calculated member using arithmetic operators and member property of dimension member