Hi folks. This post will guide you to a step by step installation of Apache Hadoop 2.6. You can install it on a Ubuntu on your laptop or on a Ubunu AMI provided by AWS. If you are installing on Ubuntu Linux on your laptop, you can jump to point 2 of part 2.
PART 1: Creating an EC2 Instance on AWS
1. From services, select “EC2”.
2. Set the region to US West (N. California) (don’t forget to change the region to US West every time you log in)
3. Go to Dashboard. Click on EC2. Under Network & Security click on Key Pairs and create new Key Pair. Save this key, it will be used in later steps.
4. To create a new Instance, click on “Launch Instance”.
5. To choose an Amazon Machine Image (AMI), Select “Ubuntu Server 14.04 LTS (HVM), SSD Volume Type – ami-9a562df2”
6. To choose an Instance type, select micro instances of type “m3.large”. (Cost $3.36/day)
7. Click “Next: Configure Instance Details”.
8. From IAM role drop down box, select “admin”. Select “Prevention against accidental termination”
check box. Then hit “ Next: Add Storage ”.
9. If you don’t have admin role. Go to Dashboard and click IAM. Create a new role. Under AWS service role select Amazon EC2. It will show different policy templates. Choose administrator access and save.
10. Click “Next: Tag Instance” again in Storage device settings. (default settings)
11. Select “Create a new security group” checkbox. > Security Group name -> “open ports”.
12. To enable ping, select “All ICMP” in the Create a new rule drop-down and click on “Add Rule.”
Do the same to enable HTTP (port 80 & 8000) accesses, then click “Continue.”
13. To allow Hadoop to communicate and expose various web interfaces, we need to open a number of
ports: 22, 9000, 9001, 50070, 50030, 50075, 50060. Again click on “Add Rule” and enable these ports.
Optionally you can enable all traffic. But be careful and don’t share your PEM key or aws credentials
with anyone or on websites like Github.
14. Review: Click “Launch” and click “Close” to close the wizard.
15. Now to access your EC2 Instances, click on “instances” on your left pane.
16. Select the instance check box and hit “Launce Instance” (It will take a while to start the virtual
instance. Go ahead once its shows it is “running”).
17. Now click on “connect” for how to SSH in your instance.
PART 2: Installing Apache Hadoop
1. Login to new EC2 instance using ssh
ssh -i aws-key.pem ubuntu@54.183.30.236
2. Login as root user to install base packages (java 8)
sudo su
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
Note: If you have any other version of Java, it is fine as long as you keep the directory paths proper in the below steps.
3. Check the java version
java –version
4. Download latest stable Hadoop using wget from one of the Apache mirrors.
wget http://www.trieuvan.com/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
tar xzf hadoop-2.6.0.tar.gz
5. Create a directory where the hadoop will store its data. We will set this directory path in hdfs-site.
mkdir hadoopdata
6. Add the Hadoop related environment variables in your bash file.
vi ~/.bashrc
Copy and paste these environment variables.
export HADOOP_HOME=/home/ubuntu/hadoop-2.6.0
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Save and exit and use this command to refresh the bash settings.
source ~/.bashrc
7. Setting hadoop environment for password less ssh access. Password less SSH Configuration is a mandatory installation requirement. However it is more useful in distributed environment.
ssh-keygen -t rsa -P ''
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
## check password less ssh access to localhost
ssh localhost
#exit from inner localhost shell
exit
8. Set the hadoop config files. We need to set the below files in order for hadoop to function properly.
- core-site.xml
- hadoop-env.sh
- yarn-site.xml
- hdfs-site.xml
- mapred-site.xml
# go to directory where all the config files are present (cd /home/ubuntu/hadoop-2.6.0/etc/Hadoop)
- Copy and paste the below configurations in core-site.xml
##Add the following text between the configuration tabs.
<property>
<name>hadoop.tmp.dir</name>
<value>/home/ubuntu/hadooptmp/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
- Copy and paste the below configurations in hadoop-env.sh
# get the java home directory using:
readlink -f `which java`
Example output: /usr/lib/jvm/java-8-oracle/jre/bin/java (NOTE THE JAVA_HOME PATH. JUST GIVE THE BASE DIRECTORY PATH)
##Need to set JAVA_HOME in hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
- Copy and paste the below configurations in mapred-site.xml
#copy mapred-site.xml from mapred-site.xml.template
cp mapred-site.xml.template mapred-site.xml
vi mapred-site.xml
#Add the following text between the configuration tabs.
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
- Copy and paste the below configurations in yarn-site.xml
##Add the following text between the configuration tabs.
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
- Copy and paste the below configurations in hdfs-site.xml
##Add the following text between the configuration tabs.
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property><name>dfs.name.dir</name>
<value>file:///home/ubuntu/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/ubuntu/hadoopdata/hdfs/datanode</value>
</property>
9. Formatting the HDFS file system via NameNode (after installing hadoop, for the first time we have to format the HDFS file system to make it work)
hdfs namenode -format
10. Issue the following commands to start hadoop
cd sbin/
./start-dfs.sh
./start-yarn.sh
#If you have properly done the step 5, you can start Hadoop from any directory. (Note the user should be the one where you installed Hadoop)
start-all.sh
OR you can separately start required services as below:
Name node:
hadoop-daemon.sh start namenode
Data node:
hadoop-daemon.sh start datanode
Resource Manager:
yarn-daemon.sh start resourcemanager
Node Manager:
yarn-daemon.sh start nodemanager
Job History Server:
mr-jobhistory-daemon.sh start historyserver
Once hadoop has started point your browser to http://localhost:50070/
10. Check for hadoop processes /daemons running on hadoop with Java Virtual Machine Process Status Tool.
jps
OR you can check TCP and port details by using
sudo netstat -plten | grep java
The complete set of commands used in Hadoop cli is provided here.
PART 3: Running a Sample Word Count Program
The MapReduce examples which comes along with hadoop package are located in
hadoop-[VERSION]/share/hadoop/mapreduce. You can run those jars to see whether hadoop single node cluster is set up properly.
1. An example data set is provided here. Save it under the name Holmes.txt
## Go to the directory where Holmes.txt is located and transfer from local machine to remote EC2 instance using:
scp -i ~/aws/aws-key.pem Holmes.txt ec2-user@54.193.37.165:/home/ubuntu
Make sure to change bold ones to your corresponding directory and ec2 machine address.
2. Now create a directory in Hadoop’s Distributed File System using:
hdfs dfs -ls /
hdfs dfs -mkdir /input
Go to the folder where Holmes.txt is copied and from that folder run the command
hdfs dfs -copyFromLocal Holmes.txt /input
3. Run the hadoopmapreduce-examples-2.4.1.jar as follows:
hadoop jar /home/ec2-user/hadoop-2.4.1/share/hadoop/mapreduce/hadoopmapreduce-examples-2.4.1.jar wordcount /input /out1
(Note: The installation version is 2.6.0, so the command may vary accordingly. I had Hadoop 2.4 at the time of running this jar)
‘/input’ indicates the input directory where Holmes.txt file is stored and ‘out1’ a dir where results file is stored.
4. To see the results file,
hdfs dfs -cat /out1/part-r-00000
PART 4: Running a Hadoop Program on Eclipse IDE
The jar files required to run a Hadoop job depends on the program you are writing.
You can include all the files that come with Hadoop-2.6.0.tar.gz.
- Make a new folder in your project and copy paste all the required files into it.
- Select the Jar files and right click.
- Select Build Path > add to build path.
Now all the errors will disappear and you can run the hadoop java code that you wrote.
PART 5: Running a Program exported from your Eclipse IDE in Hadoop
1. Right click the project and select export.
2. Select Jar File option and check the option to include sources.
3. Click Finish.
4. You can use scp to transfer file to EC2 or you can use it locally.
To run your job on Hadoop:
hadoop jar /path/to/your/jar/my_project.jar package_name.main_class /hdfs_input /hdfs_output
Note: Make sure you have the required input files on HDFS. (use hdfs dfs –copyFromLocal …)
TURN OFF Hadoop
Stop Hadoop by running the following command:
stop-all.sh
OR
Stop individual services by following commands:
stop-dfs.sh
stop-yarn.sh