What is Apache Hadoop? How to install or setup Hadoop in CentOS?

Hadoop History

Apache Hadoop is an open-source software framework used for distributed storage, distributed and parallel processing of large datasets that can run commodity hardware with a group of computers which forms a cluster. Hadoop journey started in early 2000’s, As Doug Cutting created a search engine project called Lucene and then built a scalable search engine called Nutch with Mike Cafarella. And in 2004, Nutch Distributed File system was built and released MapReduce framework. In February 2006, Cutting pulled out GDFS and MapReduce out of the Nutch code base and created a new incubating project, under Lucene umbrella, which he named Hadoop. Apache Hadoop consists of two logical layers — storage layer and processing layer.

Features of Hadoop

  • Scalable – One machine to thousands of machines
  • Fault tolerant – Replication possible in a cluster
  • Open source – Community effort governed under the licensing of the Apache Software Foundation.
  • Distributed storage and Large-scale processing – Large datasets are automatically split into blocks and distributed across the cluster machines.
  • Store Varied number of data formats or types

Apache Hadoop Base framework

  • Hadoop Common – libraries and utilities needed by Hadoop are part of Hadoop common
  • Hadoop Distributed File System (HDFS) – a distributed file system which stores data
  • Hadoop Map Reduce – Large-scale and distributed processing programming model
  • Hadoop Yarn – A generic processing framework with a Resource manager to manager cluster resources and schedule jobs (user applications) in a distributed environment

Setting up Hadoop 3 on Single Node in CentOS

  • Create a new ‘hadoop’ user in centos.
  • Download ‘hadoop-3.0.3.tar.gz’ file from hadoop repository and copy to hadoop user home (‘/home/hadoop/’) directory.
wget http://redrockdigimark.com/apachemirror/hadoop/common/hadoop-3.0.3/hadoop-3.0.3.tar.gz
  • Extract tar file on the home folder which will create hadoop-version (hadoop-3.0.03) folder.
tar -xvzf hadoop-3.0.3.tar.gz
  • Create ‘hadoopdata’ folder in the hadoop3 folder home ‘/home/hadoop/hadoop-3.0.3/hadoopdata’
mkdir -p /home/hadoop/hadoop-3.0.3/hadoopdata
  • Install Java (Open JDK) in centos using yum repository manager
sudo yum install java-1.8.0-openjdk-devel
  • Open the ‘~/.bashrc’ file and add the following lines at the end and save
vi ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export HADOOP_HOME=/home/hadoop/hadoop-3.0.3
export PATH=$HADOOP_HOME/bin:$JAVA_HOME/bin:$PATH
  • Enter the below commands on terminal to setup passwordless connection
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
ssh localhost

7. Update the files in ‘$HADOOP_HOME/etc/hadoop’ folder with the corresponding content as below.

'core-site.xml'
'hdfs-site.xml'
'mapred-site.xml'
  • Format the ‘namenode’ using the below command
hadoop namenode -format
  • Start the Hadoop by using the following command
start-all.sh
  • We can stop the Hadoop by using the following command
stop-all.sh
0 Shares:
Leave a Reply

Your email address will not be published. Required fields are marked *