Hadoop 2.2 Single Node Installation on CentOS 6.5


Introduction

This HOWTO covers Hadoop 2.2 installation with CentOS 6.5. My series of tutorials are meant just as that – tutorials. The intent is to allow the user to gain familiarity with the application and should not be construed as any type of best practices document to be used in a production environment and as such performance, reliability and security considerations are compromised. The tutorials are freely available and may be distributed with the proper acknowledgements. Actual screenshots of the commands are used to eliminate any possibility of typographical errors, in addition long sequences of text are placed in front of the screenshots to facilitate copy and paste. Command text is printed using Courier font. In general the document will only cover the bare minimum of how to get a single node cluster up and running with the emphasis on HOW rather than WHY. For more in depth information the reader should consult the many excellent publications on Hadoop such as Tom White's – Hadoop: The Definitive Guide, 3rd edition and Eric Sammer's – Hadoop Operations along with the Apache Hadoop website.

Please consult www.alan-johnson.net for an online version of this document.

Prerequisites

  • CentOS 6.5 installed

Machine configuration

In this HOWTO a physical machine was used; but for educational purposes Vmware Workstation or Virtualbox (https://www.virtualbox.org/) would work just as well. The screenshot below shows acceptable VM machine settings for VMWare.

Note an additional Network Adapter and physical drive have been added. Memory allocation is 2GB which is sufficient for the tutorial.

User configuration

If installing CentOS from scratch then select a user <hadoopuser> at installation time otherwise the user can be added by the command below. In addition create a group called <hadoopgroup>.

Note the initial configuration is done as user root.

Now make hadoopuser a member of hadoopgroup.

usermod –g hadoopgroup hadoopuser

Verify by issuing the id command.ss

id hadoopuser

The next step is to give hadoopuser access to sudo commands. Do this by executing the visudo command and adding the highlighted line shown below.

Reboot and now log in as user hadoopuser.

Setting up ssh

Setup ssh for password-less authentication using keys.

ssh-keygen -t rsa -P ''

Next change file ownership and mode.

sudo chown hadoopuser ~/.ssh

sudo chmod 700 ~/.ssh

sudo chmod 600 ~/.ssh/id_rsa

Then append the public key to the file authorized_keys

sudo cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Change permissions.

sudo chmod 600 ~/.ssh/authorized_keys

Edit /etc/ssh/sshd_config

Set PasswordAuthentication to no and allow empty passwords
as shown below in the extract of the file.

Verify that login can be accomplished without requiring a password.

Installing and configuring java

It is recommended to install the full openJDK package to take advantage of some of the java tools,

Installing openJDK

yum install java-1.7.0-openjdk*

After the installation verify the java version

java -version

The folder etc/alternatives contains a link to the java installation; perform a long listing of the file to show the link and use it as the location for JAVA_HOME.

Set the JAVA_HOME environmental variable by editing ~/.bashrc

Installing Hadoop

Downloading Hadoop

From the Hadoop releases page http://hadoop.apache.org/releases.html , download hadoop-2.2.0.tar.gz from one of the mirror sites.

Next untar the file

tar xzvf hadoop-2.2.0.tar.gz

Move the untarred folder

sudo mv hadoop-2.2.0 /usr/local/hadoop

Change the ownership with sudo chown -R hadoopuser:hadoopgroup /usr/local/hadoop

Next create namenode and datanode folders

mkdir -p ~/hadoopspace/hdfs/namenode

mkdir -p ~/hadoopspace/hdfs/datanode

Configuring Hadoop

Next edit ~/.bashrc to set up the environmental variables for Hadoop

# User specific aliases and functions

export HADOOP_INSTALL=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export PATH=$PATH:$HADOOP_INSTALL/sbin
export PATH=$PATH:$HADOOP_INSTALL/bin

Now apply the variables.

There are a number of xml files within the Hadoop folder that require editing which are:

  • mapred-site.xml
  • yarn-site.xml
  • core-site.xml
  • hdfs-site.xml
  • hadoop-env.sh

The files can be found in /usr/local/hadoop/etc/hadoop/. First copy the mapred-site template file over and then edit it.

mapred-site.xml

Add the following text between the configuration tabs.

<property>
  <name>mapreduce.framework.name</name>
 <value>yarn</value>
</property>

yarn-site.xml

Add the following text between the configuration tabs.

<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>

core-site.xml

Add the following text between the configuration tabs.
<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:9000</value>
</property>

hdfs-site.xml

Add the following text between the configuration tabs.

<property>
 <name>dfs.replication</name>
<value>1</value>
</property>

<property>
  <name>dfs.name.dir</name>
  <value>file:///home/hadoopuser/hadoopspace/hdfs/namenode</value>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>file:///home/hadoopuser/hadoopspace/hdfs/datanode</value>
</property>

Note other locations can be used in hdfs by separating values with a comma, e.g.

file:/home/hadoopuser/hadoopspace/hdfs/datanode, .disk2/Hadoop/datanode, . .

hadoop-env.sh

Add an entry for JAVA_HOME

export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64/

Next format the namenode.

. . .

Issue the following commands.

start-dfs.sh
start-yarn.sh

Issue the jps command and verify that the following jobs are running:

At this point Hadoop has been installed and configured

Testing the installation

A number of test files exist that can be used to benchmark Hadoop. Entering the command below without any arguments will list available tests.

The TestDFSIO test below can be used to measure read performance - initially create the files and then read:

hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.2.0-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 100

hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.2.0-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 100

The results are logged in TestDFSIO_results.log which will show throughput rates:

During the test run a message will be printed with a tracking url such as that shown below:

The link can be selected or the address can be pasted into a browser.

Another test is mrbench which is a map/reduce test.

hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.2.0-tests.jar mrbench –maps 100

Finally the test below is used to calculate pi. The first parameter refers to the number of maps and the second is the number of samples for each map.

hadoop jar $HADOOP_INSTALL/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 10 20

. . .

Note accuracy can be improved by increasing the value of the second parameter.

Working from the command line

Invoking a command without any or insufficient parameters will generally print out help data"

hdfs commands

hdfs dfsadmin –help

. . .

hadoop commands

hadoop version

Web Access

The location for checking the Namenode status is at localhost:50070/. This web page contains status information relating to the cluster.

There are also links for browsing the filesystem.

Logs can also be examined from the NameNode Logs link.

. . .

The secondary namenode can be accessed using port 50090

On line documentation

Comprehensive documentation can be found at the Apache website or locally using a browser by pointing it at $HADOOP_INSTALL/share/doc/Hadoop/index.html/

Feedback, corrections and suggestions are welcome, as are suggestions for further HOWTOs.

Advertisements

72 thoughts on “Hadoop 2.2 Single Node Installation on CentOS 6.5

  1. Got one issue (0 datanode is been selected) while doing the setup with the above step….

    As per above steps, namenode and datanode are defined in seperate folder, where as datanode should be there inside namenode.

    I performed below steps to fix that issue

    1) mkdir -p ~/hadoopspace/hdfs/namenode/datanode

    2)
    dfs.data.dir
    file:///home/hadoopuser/hadoopspace/hdfs/namenode/datanode

    • HI alanxelsys,
      Thanks for this tutorial.
      Can you please provide hadoop administration examples / exercise to learn hadoop administration.

      -Zaheer

      • Hi Zaheer,
        In general – I can’t really answer questions after publishing this as I no longer have access to the original implementation. I am not a Hadoop admin or expert, my hope though, is that other users will contribute to this forum and answer questions. Typically my writings are a log of my experiences which I put on line and share for free with other interested parties – Thx

  2. Hi .. please explain the procedure of MapR installation in centos 6.5.
    hadoop installed as per the above procedure .

    Thank you

  3. Thanks, u saved me many hours and made the installation smooth.
    I really happy that people like u give from their time to help others.

  4. When I try to run the example in the tutorial:

    hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.2.0-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 100

    I get this:

    14/04/20 21:07:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1398001762075_0006
    14/04/20 21:08:02 INFO impl.YarnClientImpl: Submitted application application_1398001762075_0006 to ResourceManager at /127.0.0.1:8032
    14/04/20 21:08:02 INFO mapreduce.Job: The url to track the job: http://Yasin:8088/proxy/application_1398001762075_0006/
    14/04/20 21:08:02 INFO mapreduce.Job: Running job: job_1398001762075_0006

    Then nothing happens, and the link for tracking the job doesn’t load.
    Could you please help me? I have been searching in vain for a solution the whole day.

      • But the terminal shows no completion messages, or any messages at all, after this one:
        14/04/20 21:08:02 INFO mapreduce.Job: Running job: job_1398001762075_0006

        And when I kill the job with: hadoop job -kill [job_Id], I get the following info:
        INFO mapreduce.Job: map 0% reduce 0%

        So I’m guessing it never even started, even though I sometimes leave it for many minutes.

        P.S Thanks for replying so quickly, you are faster than SO 🙂

        • Hi Yasin, can you try a top command or a ps -eaf to see if it is running – it does look like it is stuck though? Also verify with jps that everything is running as well I can try to respond quickly on the weekend but unfortunately have to go back to my day job during the week so I may not be as fast (since this is really only a hobby for me)?
          Also a lot of people are using this site so they may also have suggestions as well – Good luck!

          • The top command showed some java processes running, but when I pressed i, to hide all zombie processes they disappear. This time there was something new though. The job failed with this:
            http://pastebin.com/KHPWPNbQ

            The jps command doesn’t work, command not found, but using ps aux | grep java instead, the output is this:(Which I can see in that the processes are running, but I’m not sure exactly)
            http://pastebin.com/8zQ1KEug

            Also, when attempting to run start-dfs.sh and start-yarn.sh, it tells me that the processes are running and I should stop them first.

            • try recreating the namenode and datanode dirs and reformatting the namenode. Also jps should work, this might point to a java configuration issue. I would try and get jps lsit working first to list the processes shown

              • Final update. I did as suggested, but there was no change. JPS not working was due to my having a different Java installation than the one you suggested, but I think that was irrelevant, because it still didn’t work after I installed it and JPS was working.
                Since I had an assignment to be delivered, and the VM crashed as well, I installed 1.2, and got it working. Thanks for your help, it was much appreciated.

  5. Very nice directions!!! First one I found for Hadoop and CentOS that is so complete. I have one problem: when I fun jps I have no Datanodes that are started. Everything else I ahve done in the exact order shown and all of it has “worked”. Any ideas of where I should check?

    Thanks again.

    Dennis

    • Hi Dennis, can you try re-creating namenode and datanode and reformatting to see if that helps, unfortunately when things goes wrong there are often no or misleading error messages. Assume you tried rebooting etc. Also have a look at the env variables to see if this helps.

  6. AJ. Got it. Nothing to do with your directions; they are solid. I was getting a java.net.UnknownHostException when formatting the name node. It did not recognize my machine’s name “vostro400”. I had to edit the /etc/hosts file with the following line:
    127.0.0.1 locahost.localdomain localhost vostro400 vostro400.localdomain
    Then everything came up as expected.
    Thanks again for the good work.
    Dennis

  7. I followed the same steps as given, but when i am trying to format the namenode its saying hdfs command not found. please advice.

  8. Stuck on the hdfs namenode -format
    hdfs command not found

    At solution that anybody might have for this.

  9. I see where I need to make an adjustment, in your tutorial you have mention.
    export HADOOP_INSTALL=/usr/local/hadoop

    But I have a sub folder in the hadoop folder called hadoop-2.4.0, which I need to have mention as
    export HADOOP_INSTALL=/usr/local/hadoop/hadoop-2.4.0

    Is this right, please let me know.

    • Yes, you need to change any reference that I make to my folder to be myfolder/yoursubfolder where myfolder is hadoop and your subfolder is hadoop-2.4.0
      So as you say /usr/local/hadoop will become /usr/local/hadoop/hadoop-2.4.0. You will need to check for any other references to this as well. One final point I used V2.2 rather than 2.4. Hopefully there should be no issues but I have not tried it with V2.4.
      Thx AJ

  10. I am getting this error what to do

    [hadoop@prasad namenode]$ start-dfs.sh
    14/07/04 17:44:28 WARN util.NativeCodeLoader: Unable to load native-hadoop libra
    ry for your platform… using builtin-java classes where applicable
    14/07/04 17:44:28 WARN conf.Configuration: bad conf file: element not
    14/07/04 17:44:28 WARN conf.Configuration: bad conf file: element not
    14/07/04 17:44:28 WARN conf.Configuration: bad conf file: element not
    14/07/04 17:44:28 WARN conf.Configuration: bad conf file: element not
    Starting namenodes on [prasad]
    prasad: starting namenode, logging to /usr/local/hadoop/logs/hadoop-hadoop-namen
    ode-prasad.out
    localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hadoop-da
    tanode-prasad.out
    Starting secondary namenodes [0.0.0.0]
    0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-hadoop-secondarynamenode-prasad.out
    14/07/04 17:45:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    14/07/04 17:45:32 WARN conf.Configuration: bad conf file: element not
    14/07/04 17:45:32 WARN conf.Configuration: bad conf file: element not
    14/07/04 17:45:32 WARN conf.Configuration: bad conf file: element not
    14/07/04 17:45:32 WARN conf.Configuration: bad conf file: element not

    • I would check all the configuration files above for errors, the message WARN util.NativeCodeLoader: Unable to load native-hadoop libra is K you can see that in my screenshots above but double check the xml files and also look at any log files for further information.

  11. Hi AJ and others,I am getting could not find or load main class.org.apache.hadoop.hdfs.server.namenode.NameNode.The only difference of my installation is that I used root instead of hadoopuser.Could someone please help ?

    • I have not tried it using root, sometimes there are issues with this, however double check environment variables such as paths, since there are so many manual steps to configuration I have found that typos do not always give very meaningful error message. Not sure if anyone else has tried this using a root account – if so please weigh in?

  12. Hi,

    This tutorial was very helpful. I have installed hadoop and single node cluster is working just fine. Now I am interested in connecting 2 nodes and make it multi node cluster. Do you have any tutorials for it ?

    Thanks.

    • I do intend to do this but it will not be for a while, unfortunately I am now looking at ceph and open stack first as I have been getting a lot of requests for this – thanks for you comments

  13. I am new to hadoop and linux , after following steps i installed java ,but i am not able to find java -version . java installation was successful. where i can be wrong.

      • Where is the path environment defined in instructions , i tried to remove all java folders where it is been installed and again i tried to install java , it says Nothing to do , as if it is already installed . but if i do java -version , it says bash: java: command not found.help me to resolve this issue

        • This is really a UNIX/Linux thing it and ut would be hard for me to cover the basics here – but above it discusses the /etc/alternatives file which should show where the Java has been installed to – the arrow will be the path to the file. Have a look in this location. Normally the installation should set this up but you can also edit the ~/.bashrc file to update the path. Type in env as well or enter echo $PATH. Good luck!

  14. Hi,

    I have installed the hadoop single node cluster on VM with Ubuntu OS. I am able to run hadoop on command line. But I am not able to run hadoop web interface. I have my VM ip address as private_network: 192.168.56.133. Do I have to do anything with this? But If I give 192.168.56.133 on web browser I am able to see file not found. Please help me.

    Thanks,
    Kowsalya

    • Try pinging your IP from another node to make sure it is accessible, also when you run the command does it run long enough as it may have finished before the browser has a chance to display it. o you get the URL tracking message when you run the job?

  15. How to ping inside the node.Sorry I am new to Hadoop.Below is the URL tracking message
    14/09/09 21:44:48 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
    14/09/09 21:44:48 INFO mapreduce.Job: Running job: job_local219030198_0001

    Below is my netstat command output
    nethduser@packer-virtualbox-iso:/etc/apache2/sites-available$ sudo netstat -plunt
    Active Internet connections (only servers)
    Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
    tcp 0 0 0.0.0.0:50020 0.0.0.0:* LISTEN 31886/java
    tcp 0 0 127.0.0.1:54310 0.0.0.0:* LISTEN 31769/java
    tcp 0 0 127.0.0.1:9000 0.0.0.0:* LISTEN 26550/php-fpm.conf)
    tcp 0 0 0.0.0.0:50090 0.0.0.0:* LISTEN 32040/java
    tcp 0 0 127.0.0.1:3306 0.0.0.0:* LISTEN 13788/mysqld
    tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 26631/apache2
    tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 31769/java
    tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1948/sshd
    tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 31886/java
    tcp 0 0 0.0.0.0:50075 0.0.0.0:* LISTEN 31886/java
    tcp 0 0 0.0.0.0:443 0.0.0.0:* LISTEN 26631/apache2
    tcp6 0 0 :::8040 :::* LISTEN 32270/java
    tcp6 0 0 :::8042 :::* LISTEN 32270/java
    tcp6 0 0 :::54986 :::* LISTEN 32270/java
    tcp6 0 0 :::22 :::* LISTEN 1948/sshd
    tcp6 0 0 :::8088 :::* LISTEN 32174/java
    tcp6 0 0 :::8030 :::* LISTEN 32174/java
    tcp6 0 0 :::8031 :::* LISTEN 32174/java
    tcp6 0 0 :::8032 :::* LISTEN 32174/java
    tcp6 0 0 :::8033 :::* LISTEN 32174/java
    udp 0 0 0.0.0.0:68 0.0.0.0:* 686/dhclient3
    udp 0 0 192.168.56.133:123 0.0.0.0:* 14005/ntpd
    udp 0 0 10.0.2.15:123 0.0.0.0:* 14005/ntpd
    udp 0 0 127.0.0.1:123 0.0.0.0:* 14005/ntpd
    udp 0 0 0.0.0.0:123 0.0.0.0:* 14005/ntpd
    udp6 0 0 fe80::a00:27ff:fe50:123 :::* 14005/ntpd
    udp6 0 0 fe80::a00:27ff:fe23:123 :::* 14005/ntpd
    udp6 0 0 ::1:123 :::* 14005/ntpd
    udp6 0 0 :::123 :::* 14005/ntpd

    • Hi I probably will not be able to offer too much more help here (due to my extensive travel and my day job constraints), however make sure the job runs long enough as the link will no longer be valid when it finishes. Also what happens when you type localhost as the URL rather than the actual IP address. Pinging inside the node should work (just enter ping 192.168.56.133″ at the command line), I was more curious to see if it could be reached from another node. If the web server is not running check for any error messages when you did the install. Also see if the web server is running.

      You could also try some of the hadoop forums as I set this up quite a while ago and no longer have access to a running system

      Thanks

  16. Hi AJ, I must say this is one of the best tutorial on Hadoop installation, very precise and accurate. Thanks again for helping us all.

    Regards,
    Farrukh

  17. Thanks for the excellent guide. These instructions work with Hadoopv2.5.1 & CentOS6.5 (your steps modified to match different version numbers of course).
    This is also the best guide I have found. Thank-you very much.
    A few thoughts and questions:
    (1) I can’t access the web-interface from my workstation, but I can from within a shell running on the ‘master’ server. (i.e. elinks/links/lynx http://192.168.0.xxx/… works but only locally, when I try to access it from my workstation I can’t connect [Error: problem loading page])
    (2) I need to make a multinode cluster (6 servers total) do you have any advice on how I should proceed and/or plans on adding to this guide?
    (3) I think ‘jps’ is missing from some java installations.
    $java -version
    java version “1.7.0_65”
    OpenJDK Runtime Environment (rhel-2.5.1.2.el6_5-x86_64 u65-b17)
    OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

    but other tests/installs do have it.

    • Hi Matthew, regarding access from the workstation verify that you can ping the IP and I would also disable any firewalls to see if that helps. Regarding Java, I found that one of the trickiest areas was here. I ran into the jps issue a number of times and eventually found a process that worked. I have not got around to doing anything more yet on the multiple node cluster but hopefully there will some good resources out there – Good Luck – AJ

  18. I have one question, why are you adding the user to the sudoers file? It breaks the security for the hadoop installation. Is better to keep the user as a “normal” user without any rights to the system.

    Otherwise a good tutorial-

    • Thanks Patrick – as you see security is probably not my strong point – working in a test environment I have bad habits such as disabling firewalls turning off selinux etc. I take your point but I do not recommend that any of my tutorials are used in production environments. I typically turn off security as it is hard enough to get many of these applications working without security blocking them. So my approach is to turn off security and then when the system is up and running “harden” the system according to needs but yes I agree there is no excuse really for sloppiness!

      Thanks – AJ

  19. [sudo] password for hadoopuser:
    hadoopuser is not in the sudoers file. This incident will be reported.
    [hadoopuser@osboxes ~]$ sudo chown hadoopuser ~/.ssh
    [sudo] password for hadoopuser:
    hadoopuser is not in the sudoers file. This incident will be reported.
    [hadoopuser@osboxes ~]$ sudo chmod 700 ~/.ssh
    [sudo] password for hadoopuser:
    hadoopuser is not in the sudoers file. This incident will be reported.
    [hadoopuser@osboxes ~]$ sudo chmod 600 ~/.ssh/id_rsa
    [sudo] password for hadoopuser:
    hadoopuser is not in the sudoers file. This incident will be reported.
    [hadoopuser@osboxes ~]$ sudo cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    bash: /home/hadoopuser/.ssh/authorized_keys: No such file or directory

    its shows this in Vmware workstation
    what should i do now?

    • Did you miss this step?
      “The next step is to give hadoopuser access to sudo commands. Do this by executing the visudo command and adding the highlighted line shown below.”

  20. Hi AJ

    I installed Hadoop 2.2 and Redhat 6.5, everything seems OK, except Datanode not starting, would you please advise?

Comments and suggestions for future articles welcome!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s