What is Apache Hadoop? How to install or setup Hadoop in CentOS?

Hadoop History

Apache Hadoop is an open-source software framework used for distributed storage, distributed and parallel processing of large datasets that can run commodity hardware with a group of computers which forms a cluster. Hadoop journey started in early 2000’s, As Doug Cutting created a search engine project called Lucene and then built a scalable search engine called Nutch with Mike Cafarella. And in 2004, Nutch Distributed File system was built and released MapReduce framework. In February 2006, Cutting pulled out GDFS and MapReduce out of the Nutch code base and created a new incubating project, under Lucene umbrella, which he named Hadoop. Apache Hadoop consists of two logical layers — storage layer and processing layer.

Features of Hadoop

Scalable – One machine to thousands of machines
Fault tolerant – Replication possible in a cluster
Open source – Community effort governed under the licensing of the Apache Software Foundation.
Distributed storage and Large-scale processing – Large datasets are automatically split into blocks and distributed across the cluster machines.
Store Varied number of data formats or types

Apache Hadoop Base framework

Hadoop Common – libraries and utilities needed by Hadoop are part of Hadoop common
Hadoop Distributed File System (HDFS) – a distributed file system which stores data
Hadoop Map Reduce – Large-scale and distributed processing programming model
Hadoop Yarn – A generic processing framework with a Resource manager to manager cluster resources and schedule jobs (user applications) in a distributed environment

Setting up Hadoop 3 on Single Node in CentOS

Create a new ‘hadoop’ user in centos.
Download ‘hadoop-3.0.3.tar.gz’ file from hadoop repository and copy to hadoop user home (‘/home/hadoop/’) directory.

wget http://redrockdigimark.com/apachemirror/hadoop/common/hadoop-3.0.3/hadoop-3.0.3.tar.gz

Extract tar file on the home folder which will create hadoop-version (hadoop-3.0.03) folder.

tar -xvzf hadoop-3.0.3.tar.gz

Create ‘hadoopdata’ folder in the hadoop3 folder home ‘/home/hadoop/hadoop-3.0.3/hadoopdata’

mkdir -p /home/hadoop/hadoop-3.0.3/hadoopdata

Install Java (Open JDK) in centos using yum repository manager

sudo yum install java-1.8.0-openjdk-devel

Open the ‘~/.bashrc’ file and add the following lines at the end and save

vi ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export HADOOP_HOME=/home/hadoop/hadoop-3.0.3
export PATH=$HADOOP_HOME/bin:$JAVA_HOME/bin:$PATH

Enter the below commands on terminal to setup passwordless connection

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
ssh localhost

7. Update the files in ‘$HADOOP_HOME/etc/hadoop’ folder with the corresponding content as below.

'core-site.xml'
'hdfs-site.xml'
'mapred-site.xml'

Format the ‘namenode’ using the below command

hadoop namenode -format

Start the Hadoop by using the following command

start-all.sh

We can stop the Hadoop by using the following command

stop-all.sh

What is Apache Hadoop? How to install or setup Hadoop in CentOS?

Hadoop History

Features of Hadoop

Apache Hadoop Base framework

Setting up Hadoop 3 on Single Node in CentOS

Tags:

Leave a Reply Cancel reply

Installing Cygwin in Windows

What is DevOps?