Keeping track of updates in software world

In my experience keeping track of the tech related changes is a very difficult task. Specifically the ever updating world of software industry. Now, by keeping track, I don’t mean to master a specific tool or language, but just to have a basic knowledge as to what new changes have been included in the latest release or some cool hacks done by some person.

Here are some things I do to keep myself updated with the industry.

  1. Install Feedly on the PC or chrome plugin for feedly.
  2. Have feedly or gReader on your cellphone.
  3. Dedicate an email address or a create rules so as to create a separate folder for such emails as they can be annoying sometimes.

The feeds that I follow:

Tech & Startups:
Reddit programming
TechCrunch
Startups-TechCrunch
Fundings & Exits – TechCrunch
Buzzfeed
Buzzfeed-Tech
Hackernews
Martin Fowler
VentureBeat (VB>>Bigdata, VB>>Cloud, VB>>Dev)
Wired
PCWorld
The internet of things
Mozilla hacks

Programming focused:
Recent questions – Stack Exchange
GeeksforGeeks
CareerCup
Stack Overflow Blog

Company feeds:
Dropbox Tech Blog
Github Engineering
Netflix TechBlog
Facebook code
Yelp engineering and product blog
Yahoo engineering

Podcasts:
Podcast.__init__(‘Python’)
Talk python to me

Apart from the feeds, other methods that I use to keep myself update are EMAIL, few websites related to Cloud computing, Video channels and meetups. Below are few of them.

Email Subscriptions:
Javascript weekly
Informationweek
Infoworld
NoSQL weekly
DarkReading.com
hackernewsletter.com
javaworld.com
networkcomputing.com
frontendfoc.us
opensource.com

Websites:
High Scalability
SDTimes

Videos:
Brighttalk.com
Youtube Channel> Masters of scale
Youtube Channel> GeeksForGeeks
Youtube Channel> NetflixOpenSource
Youtube Channel> TechCrunch
Youtube Channel> FOSDEM

Meetups:
Local & online meetups are a great source of knowledge.
Some of the active meetup groups are:
GOOGLE CLOUD PLATFORM (GCP) ONLINE MEETUP
SF BAY AREA OPENSTACK
SILICON VALLEY NOSQL, BIG DATA, AND HIGH TECH
BAY AREA KUBERNETES MEETUP
BIG DATA MEETUP @ LINKEDIN
SF BAY ACM CHAPTER

Advertisements

Useful linux commands: part 2: Process related commands

1. PS

Process status (PS) gives information about processes running in memory. If you want a repetitive update of this status, use top. The options mostly used with ps are given below.

-e    select all processes

-f    does full listing

-a    select all with a tty except session leaders

-d    select all, but omit session leaders

a      select all processes on terminal,including those of other users

x     select processes without controlling ttys
For a full list of ps options, click here.

Here are some examples of ps:

  • Display all processes

# ps -ef

  • Display process by user

# ps -f -u sagar

  • Show process by name or process id

# ps -C apache2

# ps -f -p 3255,9218,7542

  • Search process by partial name

# ps -ef | grep apache

  • Sort process by cpu or memory usage

# ps aux –sort=-pcpu,+pmem

(+ = ascending order, – = descending order)

  • Display process hierarchy in a tree style

# ps -f –forest -C apache2

  • Display child processes of a parent process

# ps -o pid,uname,comm -C apache2

PID   USER    COMMAND
2359  root      apache2
4524  www-data apache2
4525  www-data apache2
4526  www-data apache2
4527  www-data apache2
4528  www-data apache2

  • Display threads of a process

# ps -p 3150 -L

  • Change the columns to display

# ps -e -o pid,uname,pcpu,pmem,comm

It is possible to rename the column labels

# ps -e -o pid,uname=USERNAME,pcpu=CPU_USAGE,pmem,comm

  • Display elapsed time of processes

# ps -e -o pid,comm,etime

  • Turn ps into an realtime process viewer

# watch -n 1 ‘ps -e -o pid,uname,cmd,pmem,pcpu –sort=-pmem,-pcpu | head -15’

2. KILL

Kill a process by specifying its PID, either via a signal or forced termination.

# kill [-s sigspec] [-n signum] [-sigspec] jobspec or pid

# kill -l [exit_status]

# kill -l [sigspec]

-l   List the signal names

-s   Send a specific signal

-n   Send a specific signal number

Send a signal specified by sigspec or signum to the process named by job specification jobspec or process ID pid.

sigspec is either a case-insensitive signal name such as SIGINT (with or without the SIG prefix) or a signal number; signum is a signal number.

If sigspec is not present, SIGTERM is used (Terminate).

If any arguments are supplied when `-l’ is given, the names of the signals corresponding to the arguments are listed, and the return status is zero. exit_status is a number specifying a signal number or the exit status of a process terminated by a signal.

The return status is true if at least one signal was successfully sent, or false if an error occurs or an invalid option is encountered.

Common Kill Signals
Signal name Signal value Effect
SIGHUP 1 Hangup
SIGINT 2 Interrupt from keyboard
SIGQUIT 3 Quit
SIGABRT 6 Abort
SIGKILL 9 Kill signal
SIGTERM 15 Termination signal – allow an orderly shutdown
SIGSTOP 17,19,23 Stop the process

Here are some examples of kill:

  • List the running process: # ps

PID TTY TIME CMD 1293 pts/5 00:00:00 MyProgram

  • Then Kill it: # kill 1293

[2]+ Terminated MyProgram

  • To run a command and then kill it after 5 seconds:

# my_command & sleep 5
# kill -0 $! && kill $!

kill is a bash built in command: # help kill

3. TOP

Top command displays processor activity of your system and also displays tasks managed by kernel in real-time. It’ll show processor and memory being used. Use top command with ‘u‘ option this will display specific User process details as shown below.

# top -u sagar

# top

Press ‘O‘ (uppercase letter) to sort as per desired by you. Press ‘q‘ to quit from top screen.

Press (Shift+O) to Sort field via field letter, for example press ‘a‘ letter to sort process with PID (Process ID).

Press ‘z‘ option in running top command will display running process in color which may help you to identified running process easily.

Press ‘c‘ option in running top command, it will display absolute path of running process.

By default screen refresh interval is 3.0 seconds, same can be change pressing ‘d‘ option in running top command and change it as desired as shown below.

You can kill a process after finding PID of process by pressing ‘k‘ option in running top command without exiting from top window as shown below.

You can use ‘r‘ option to change the priority of the process also called Renice.

Press (Shift+W) to save the running top command results under /root/.toprc.

Press ‘h‘ option to obtain the top command help

Top output keep refreshing until you press ‘q‘. With below command top command will automatically exit after 10 number of repetition.

# top -n 10

4. FREE

Free command shows free, total and swap memory information in bytes.

# free

Free with -t options shows total memory used and available to use in bytes.

# free -t

5. SERVICE

Service command calls script located at /etc/init.d/ directory and executes the script. There are two ways to start any service.

# service httpd start

OR

# /etc/init.d/httpd start

We can also stop or restart the service accordingly.

6. KILLALL

kill processes by name.

# killall [option(s)] [–] name …

-g    –process-group           Kill the process group to which the process belongs. The kill signal is only sent once per group, even if multiple processes belonging to the same process group were found.

-l     List all known signal names.

–list name    The command/process to be killed

-r, –regexp       Interpret process name pattern as an extended regular expression.

-s signal, –signal signal           Send signal instead of the default SIGTERM.  e.g. -9 = SIGKILL

-u user    –user user           Kill only those processes the specified user owns. Command names are optional.

Example:

Kill firefox: # killall -9 mozilla-bin

7. HTOP

Interactive Process viewer, find the CPU-intensive programs currently running. You might need to install htop since it doesn’t come in the linux distribution.

# htop

HTOP has many instructions available on its console output. Here is the link for it’s man page for further details. It is similar to TOP, but better in the sense it provides more functionality and ease of use.

8. FUSER

It is used to identify processes using files or sockets and optionally kill the process that are accessing the file.

#fuser [-a|-s|-c] [-4|-6] [-n space ] [-k [-i] [-signal ] ] [-muvf] name

  • To view processes using a directory

# fuser –v

  • To view if a process is using your tcp or udp socket

# fuser -v -n tcp 80

  • To kill a process

# fuser -k 123/tcp

  • To kill a process with confirmation

# fuser -i -k 123/tcp

  • Display all processes accessing filesystem on which ‘example.txt’ resides.

# fuser -v -m example.txt

  • You can send a specific signal to a process. To view a complete list of signals available in fuser.

# fuser –l

There are many more options available in fuser. It is one of the most popular tool to troubleshoot and manage processes. Here is the link to it’s man page.

9. PGREP/ PKILL

Kill processes by a full or partial name.

pgrep searches the process table on the running system and prints the process IDs of all processes that match the criteria given on the command line.

pkill searches the process table on the running system and signals all processes that match the criteria given on the command line.

All the criteria have to match.

Example:

# pgrep -u root sshd

will only list the processes called sshd and owned by root.

# pgrep -u root,daemon

will list the processes owned by root OR daemon.

  • Find the process ID of the named daemon

# pgrep -u root named

  • Make syslog reread its configuration file

# pkill -HUP syslogd

  • Give detailed information on all xterm processes

# ps -fp $(pgrep -d, -x xterm)

 

Pgrep man page is here and pkill man page is here

10. PMAP

This command is used to show the memory map of a process.

  • To get the pid

# ps aux | grep <process_name>

Then run

# pmap -x <pid>

  • Display process map in extended format

# pmap -x 3401

3401: man pmap
Address Kbytes RSS Anon Locked Mode Mapping
00110000 1272 – – – r-x– libc-2.10.1.so
0024e000 4 – – – —– libc-2.10.1.so
0024f000 8 – – – r—- libc-2.10.1.so

b785e000 8 – – – rw— [ anon ]
bf7eb000 84 – – – rw— [ stack ]
——– ——- ——- ——- ——-
total kB 2192 – – –

  • Display in Device Format

# pmap -d 18282

 

Useful linux commands: part 1: Networking

1. IFCONFIG

ifconfig (interface configurator) command is use to initialize an interface, assign IP Address to interface and enable or disable interface on demand.

With this command you can view IP Address and Hardware/MAC address assign to interface and also MTU (Maximum transmission unit) size.

Check all the details provided by the command by typing in:

# ifconfig

OR

# ifconfig eth0

  • Assign an IP Address and Gateway to interface on the fly:

# ifconfig eth0 192.168.45.34 netmask 255.255.255.0

  • To enable or disable specific Interface.

Enable eth0:

# ifup eth0

Disable eth0:

# ifdown eth0

  • Setting MTU Size. (By default MTU size is 1500.)

# ifconfig eth0 mtu XXXX

  • Set Interface in Promiscuous mode.

Network interface only received packets belongs to that particular NIC. If you put interface in promiscuous mode it will received all the packets. This is very useful to capture packets and analyze later. For this you may require superuser access.

# ifconfig eth0 – promisc

2. PING

PING (Packet INternet Groper) command is the best way to test connectivity between two nodes.

Be it Local Area Network (LAN) or Wide Area Network (WAN). Ping uses ICMP (Internet Control Message Protocol) to communicate to other devices.

  • Ping using IP address.

# ping 8.8.8.8

  • Ping using hostname

# ping http://www.google.com

  • In Linux ping command keep executing until you interrupt. Ping with -c option to exit after N number of request.

# ping -c 4 http://www.google.com

3. TRACEROUTE

Traceroute is a network troubleshooting utility which shows number of hops taken to reach destination also determine packets traveling path.

Below we are tracing route to global DNS server IP Address and able to reach destination also shows path of that packet is traveling.

# traceroute 8.8.8.8

traceroute to 8.8.8.8 (8.8.8.8), 64 hops max, 52 byte packets
1  connect.onboard.info (10.0.0.1)  34.836 ms  7.279 ms  5.456 ms
2  172.26.96.161 (172.26.96.161)  34.796 ms  52.152 ms  33.806 ms
3  172.16.157.164 (172.16.157.164)  47.693 ms  42.050 ms  51.715 ms
4  12.249.2.49 (12.249.2.49)  53.378 ms  80.842 ms  59.054 ms
5  12.83.180.82 (12.83.180.82)  80.668 ms  97.070 ms *
6  12.122.137.181 (12.122.137.181)  49.444 ms  54.285 ms  61.081 ms
7  12.250.31.10 (12.250.31.10)  40.703 ms  51.372 ms  53.838 ms
8  209.85.244.23 (209.85.244.23)  59.456 ms
    209.85.241.171 (209.85.241.171)  48.018 ms
    209.85.244.23 (209.85.244.23)  43.476 ms
9  216.239.49.103 (216.239.49.103)  44.689 ms
    216.239.58.195 (216.239.58.195)  44.356 ms
    216.239.56.137 (216.239.56.137)  47.963 ms
10  google-public-dns-a.google.com (8.8.8.8)  45.216 ms  69.014 ms  54.033 ms

4. NETSTAT

Netstat (Network Statistic) command displays connection info, routing table information and many more statistics related to TCP, UDP and active ports.

Some useful options to combine and use are mentioned below.

-r    display routing table

-a    All ports

-t    TCP ports

-u    UDP ports

-l    all active Listening connections

-x    unix ports

-s    display statistics by protocol

-p    display service name with their PID number

-c    promiscuous mode

-i    network interface packet transactions including both transferring and receiving packets with MTU size

-g    display multicast group membership info for IPv4 & IPv6

-c 3   get netstat information every three seconds.

–statictics –raw   display raw network statistics

5. DIG

Dig (domain information groper) query DNS related information like A Record, CNAME, MX Record etc. This command mainly use to troubleshoot DNS related query.

# dig http://www.google.com

Dig command reads the /etc/resolv.conf file and queries the DNS servers listed there. The response from the DNS server is what dig displays.

  • By default dig is quite verbose. One way to cut down the output is to use the +short option. which will drastically cut the output as shown below.

# dig http://www.google.com +short

  • Query mail exchange server

# dig google.com MX

  • Query SOA record of the domain

# dig google.com SOA

  • Query TTL record of the domain

# dig google.com TTL

  • Query only ANSWER SECTION

# dig google.com +nocomments +noquestion +noauthority +noadditional +nostats

  • Querying ALL DNS Records Types

# dig google.com ANY +noall +answer

  • Reverse Look-up the DNS (IP to hostname)

# dig -x 8.8.8.8 +short

  • Querying Multiple DNS Records

# dig google.com mx +noall +answer redhat.com ns +noall +answer

6. NSLOOKUP

nslookup command is used to find out DNS related query.

# nslookup http://www.google.com

  • Reverse Domain Lookup

# nslookup 8.8.8.8

  • Specific Domain Lookup

# nslookup google-public-dns-a.google.com

  • Query MX (Mail Exchange) record

# nslookup -query=mx http://www.google.com

  • Query NS (Name Server) record

# nslookup -query=ns http://www.google.com

  • query SOA (Start of Authority) record.

# nslookup -type=soa http://www.google.com

  • query all Available DNS records.

# nslookup -query=any google.com

  • To enable Debug Mode

# nslookup -debug google.com

7. ROUTE

Route command shows and manipulates the routing table. To see default routing table in Linux, type the following command.

# route

  • Route Adding

# route add -net 10.10.10.0/24 gw 192.168.0.1

  • Route Deleting

# route del -net 10.10.10.0/24 gw 192.168.0.1

  • Adding default Gateway

# route add default gw 192.168.0.1

8. HOST

Host command to find name to IP or IP to name in IPv4 or IPv6 and also query DNS records.

# host http://www.google.com

Using -t option we can find out DNS Resource Records like CNAME, NS, MX, SOA etc.

# host -t CNAME http://www.google.com

9. ARP

ARP (Address Resolution Protocol) is used to view or add the contents of the kernel’s ARP tables.

To see default table:

# arp -e

10. ETHTOOL

ethtool is used to view or set speed and duplex of your Network Interface Card (NIC).

You can set duplex permanently in /etc/sysconfig/network-scripts/ifcfg-eth0 with ETHTOOL_OPTS variable.

# ethtool eth0

11. IWCONFIG

iwconfig is use to configure a wireless network interface. You can see and set the basic Wi-Fi details like SSID channel and encryption. You can refer man page of iwconfig to know more.

# iwconfig [interface]

12. HOSTNAME

hostname is used to identify in a network.

# hostname

You can set hostname permanently in /etc/sysconfig/network. Need to reboot box once set a proper hostname.

13. TELNET

It connects destination host via telnet protocol. If telnet connection establish on any port means connectivity between two hosts is working fine.

#telnet hostname port

This will telnet hostname with the port specified. Normally it is used to see whether host is alive and network connection is fine or not.

For more information regarding different flags of telnet check this.

Telnet is a client-server protocol, based on a reliable connection-oriented transport. Typically, this protocol is used to establish a connection to Transmission Control Protocol (TCP) port number 23, where a Telnet server application (telnetd) is listening. Telnet is a both a network protocol and an application that uses that protocol. Most often, telnet is used to connect to remote computers and issue commands on those computers. It’s like a remote control for the internet!

14. SSH

SSH (Secure Shell) is used to log into remote machines. Below are few different ways to log into remote servers.

# ssh -l username remote_server.example.com

OR

# ssh username@remote_server

  • Running remote commands from local host.

# ssh user@remote_server “cat /etc/hosts”

  • SSH into machine in debug mode

# ssh –v username@remote_server

  • If you have generated a .pem key for passwordless ssh, to log in:

# ssh –i ~/path/to/pem_key user@remote_server

15. SCP

SCP is used to transfer files from localhost to server or from server to localhost.

  • Copy file from the remote server to the localhost:

# scp username@remote_server:/home/username/abc.txt abc.txt

  • Copy file from the localhost to the remote server:

# scp abc.txt username@remote_server:/home/username/abc.txt

  • If you are using pem key, append -i ~/path/to/pem_key, just like we did it in ssh command.

16. WHO

who command simply return user name, date, time and host information. who command is similar to W command. Unlike W command who doesn’t print what users are doing.

# who

17. WHOIS

WHOIS allows you to check the Internic database for proper hostnames. It is very useful when you are trying to trace back an IP address to a specific hostname, or the reverse.

# whois -f 10.1.1.1

The -f option forces the command to skip any cache that may have stored the host state, and instead goes to the actual server to perform a lookup and verify its hostname.

Another useful variation of the command, especially for trying to identify port problems is:

# whois –port=8080 10.1.1.1

This command forces a test on the specific host’s port 8080.

18. TRACEPATH

It traces path to a network host discovering MTU along this path. It uses UDP port port or some random port. It is similar to traceroute, only does not not require superuser privileges and has no fancy options.

# tracepath6 3ffe:2400:0:109::2

1?: [LOCALHOST] pmtu 1500
1: dust.inr.ac.ru 0.411ms
2: dust.inr.ac.ru asymm 1 0.390ms pmtu 1480
2: 3ffe:2400:0:109::2 463.514ms reached
Resume: pmtu 1480 hops 2 back 2

The first column shows TTL of the probe, followed by colon. Usually value of TTL is obtained from reply from network, but sometimes reply does not contain necessary information and we have to guess it. In this case the number is followed by ?.

The second column shows the network hop, which replied to the probe. It is either address of router or word [LOCALHOST], if the probe was not sent to the network.

The rest of line shows miscellaneous information about path to the corresponding network hop.

19. UPTIME

Uptime command displays the time since your system is running and the number of users are currently logged in and also displays load average for 1,5 and 15 minutes intervals.

# uptime

20. W

W displays users currently logged in and their processes along with the load averages. It also shows the login name, tty name, remote host, login time, idle time, JCPU, PCPU, command and processes.

# w

-b    Displays last system reboot date and time.

-r    Shows current runlet.

-a, –all    Displays all information in cumulatively.

21. USERS

Users command displays currently logged in users. This command don’t have other parameters other than help and version.

# users

22. FTP & SFTP

FTP  (file transfer protocol) or SFTP (secure file transfer protocol) command is used to connect to remote ftp host.

# ftp 192.168.50.2

# sftp 192.168.50.2

We can put multiple files in remote host using mput and similarly we can do mget to download multiple files from remote host.

# ftp > mput *.txt

# ftp > mget *.txt

23. DHCLIENT

The dhclient command can release your computer’s IP address and get a new one from your DHCP server. This requires root permissions, so use sudo on Ubuntu. Run dhclient with no options to get a new IP address or use the -r switch to release your current IP address.

# sudo dhclient -r
# sudo dhclient

24. SS (NETSTAT)

ss command is a replacement for netstat.

Using ss command, you can get more information than netstat command. ss command is fast because it get all the information from the kernel userspace.

  • Listing all connections:

# ss

  • Filtering out TCP, UDP and Unix sockets

It has same flags as netstat command. Check above for flags information of netstat and ss.

25. SNMPWALK

Retrieve a subtree of management values using SNMP GETNEXT requests.

snmpwalk is an SNMP application that uses SNMP GETNEXT requests to query a network entity for a tree of information.


# snmpwalk -v 2c -c demopublic test.net-snmp.org system

SNMPv2-MIB::sysDescr.0 = HP-UX net-snmp B.10.20 A 9000/715
SNMPv2-MIB::sysObjectID.0 = OID: enterprises.ucdavis.ucdSnmpAgent.hpux10
SNMPv2-MIB::sysUpTime.0 = Timeticks: (586998396) 67 days, 22:33:03.96
SNMPv2-MIB::sysContact.0 = Wes Hardaker wjhardaker@ucdavis.edu
SNMPv2-MIB::sysName.0 = net-snmp
SNMPv2-MIB::sysLocation.0 = UCDavis
SNMPv2-MIB::sysORLastChange.0 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORID.1 = OID: SNMPv2-MIB::snmpMIB
SNMPv2-MIB::sysORID.2 = OID: IF-MIB::ifMIB
SNMPv2-MIB::sysORID.4 = OID: IP-MIB::ip
SNMPv2-MIB::sysORID.5 = OID: UDP-MIB::udpMIB
SNMPv2-MIB::sysORDescr.1 = The Mib module for SNMPv2 entities.
SNMPv2-MIB::sysORDescr.2 = The MIB module to describe generic objects for network interface sub-layers
SNMPv2-MIB::sysORDescr.4 = The MIB module for managing IP and ICMP implementations
SNMPv2-MIB::sysORDescr.5 = The MIB module for managing UDP implementations
SNMPv2-MIB::sysORUpTime.1 = Timeticks: (82) 0:00:00.82
SNMPv2-MIB::sysORUpTime.2 = Timeticks: (81) 0:00:00.81
SNMPv2-MIB::sysORUpTime.4 = Timeticks: (83) 0:00:00.83
SNMPv2-MIB::sysORUpTime.5 = Timeticks: (82) 0:00:00.82

  • To get info of a single MIB (scalar) object, or an instance OID

# snmpwalk -v 2c -c demopublic test.net-snmp.org sysDescr

SNMPv2-MIB::sysDescr.0 = HP-UX net-snmp B.10.20 A 9000/715

# snmpwalk -v 2c -c demopublic test.net-snmp.org sysDescr.0

SNMPv2-MIB::sysDescr.0 = HP-UX net-snmp B.10.20 A 9000/715

 

Hadoop: Setting up Hadoop 2.6.0 (single node) on AWS EC2 Ubuntu AMI

Hi folks. This post will guide you to a step by step installation of Apache Hadoop 2.6. You can install it on a Ubuntu on your laptop or on a Ubunu AMI provided by AWS. If you are installing on Ubuntu Linux on your laptop, you can jump to point 2 of part 2.

PART 1: Creating an EC2 Instance on AWS

1. From services, select “EC2”.
2. Set the region to US West (N. California) (don’t forget to change the region to US West every time you log in)
3. Go to Dashboard. Click on EC2. Under Network & Security click on Key Pairs and create new Key Pair. Save this key, it will be used in later steps.
4. To create a new Instance, click on “Launch Instance”.
5. To choose an Amazon Machine Image (AMI), Select “Ubuntu Server 14.04 LTS (HVM), SSD Volume Type – ami-9a562df2”
6. To choose an Instance type, select micro instances of type “m3.large”. (Cost $3.36/day)
7. Click “Next: Configure Instance Details”.
8. From IAM role drop down box, select “admin”. Select “Prevention against accidental termination”
check box. Then hit “ Next: Add Storage ”.
9. If you don’t have admin role. Go to Dashboard and click IAM. Create a new role. Under AWS service role select Amazon EC2. It will show different policy templates. Choose administrator access and save.
10. Click “Next: Tag Instance” again in Storage device settings. (default settings)
11. Select “Create a new security group” checkbox. > Security Group name -> “open ports”.
12. To enable ping, select “All ICMP” in the Create a new rule drop-down and click on “Add Rule.”
Do the same to enable HTTP (port 80 & 8000) accesses, then click “Continue.”
13. To allow Hadoop to communicate and expose various web interfaces, we need to open a number of
ports: 22, 9000, 9001, 50070, 50030, 50075, 50060. Again click on “Add Rule” and enable these ports.
Optionally you can enable all traffic. But be careful and don’t share your PEM key or aws credentials
with anyone or on websites like Github.
14. Review: Click “Launch” and click “Close” to close the wizard.
15. Now to access your EC2 Instances, click on “instances” on your left pane.
16. Select the instance check box and hit “Launce Instance” (It will take a while to start the virtual
instance. Go ahead once its shows it is “running”).
17. Now click on “connect” for how to SSH in your instance.

PART 2: Installing Apache Hadoop

1. Login to new EC2 instance using ssh
ssh -i aws-key.pem ubuntu@54.183.30.236

2. Login as root user to install base packages (java 8)
sudo su
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer

Note: If you have any other version of Java, it is fine as long as you keep the directory paths proper in the below steps.

3. Check the java version
java –version

4. Download latest stable Hadoop using wget from one of the Apache mirrors.
wget http://www.trieuvan.com/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
tar xzf hadoop-2.6.0.tar.gz

5. Create a directory where the hadoop will store its data. We will set this directory path in hdfs-site.

mkdir hadoopdata

6. Add the Hadoop related environment variables in your bash file.

vi ~/.bashrc

Copy and paste these environment variables.

export HADOOP_HOME=/home/ubuntu/hadoop-2.6.0
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Save and exit and use this command to refresh the bash settings.

source ~/.bashrc

7. Setting hadoop environment for password less ssh access. Password less SSH Configuration is a mandatory installation requirement. However it is more useful in distributed environment.

ssh-keygen -t rsa -P ''
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

## check password less ssh access to localhost
ssh localhost

#exit from inner localhost shell
exit

8. Set the hadoop config files. We need to set the below files in order for hadoop to function properly.

  • core-site.xml
  • hadoop-env.sh
  • yarn-site.xml
  • hdfs-site.xml
  • mapred-site.xml

# go to directory where all the config files are present (cd /home/ubuntu/hadoop-2.6.0/etc/Hadoop)

  • Copy and paste the below configurations in core-site.xml

##Add the following text between the configuration tabs.

<property>
<name>hadoop.tmp.dir</name>
<value>/home/ubuntu/hadooptmp/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>

  • Copy and paste the below configurations in hadoop-env.sh

# get the java home directory using:

readlink -f `which java`

Example output: /usr/lib/jvm/java-8-oracle/jre/bin/java (NOTE THE JAVA_HOME PATH. JUST GIVE THE BASE DIRECTORY PATH)

##Need to set JAVA_HOME in hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-oracle

  • Copy and paste the below configurations in mapred-site.xml

#copy mapred-site.xml from mapred-site.xml.template
cp mapred-site.xml.template mapred-site.xml
vi mapred-site.xml

#Add the following text between the configuration tabs.
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>

  • Copy and paste the below configurations in yarn-site.xml

##Add the following text between the configuration tabs.
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

  • Copy and paste the below configurations in hdfs-site.xml

##Add the following text between the configuration tabs.
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property><name>dfs.name.dir</name>
<value>file:///home/ubuntu/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/ubuntu/hadoopdata/hdfs/datanode</value>
</property>

9. Formatting the HDFS file system via NameNode (after installing hadoop, for the first time we have to format the HDFS file system to make it work)
hdfs namenode -format

10. Issue the following commands to start hadoop
cd sbin/
./start-dfs.sh
./start-yarn.sh

#If you have properly done the step 5, you can start Hadoop from any directory. (Note the user should be the one where you installed Hadoop)

start-all.sh

OR you can separately start required services as below:

Name node:
hadoop-daemon.sh start namenode

Data node:
hadoop-daemon.sh start datanode

Resource Manager:
yarn-daemon.sh start resourcemanager

Node Manager:
yarn-daemon.sh start nodemanager

Job History Server:
mr-jobhistory-daemon.sh start historyserver

Once hadoop has started point your browser to http://localhost:50070/

hadoop web console

10. Check for hadoop processes /daemons running on hadoop with Java Virtual Machine Process Status Tool.
jps

hadoop start

OR you can check TCP and port details by using

sudo netstat -plten | grep java

hadoop netstat output

The complete set of commands used in Hadoop cli is provided here.

PART 3: Running a Sample Word Count Program

The MapReduce examples which comes along with hadoop package are located in
hadoop-[VERSION]/share/hadoop/mapreduce. You can run those jars to see whether hadoop single node cluster is set up properly.

1. An example data set is provided here. Save it under the name Holmes.txt

## Go to the directory where Holmes.txt is located and transfer from local machine to remote EC2 instance using:

scp -i ~/aws/aws-key.pem Holmes.txt ec2-user@54.193.37.165:/home/ubuntu

Make sure to change bold ones to your corresponding directory and ec2 machine address.

2. Now create a directory in Hadoop’s Distributed File System using:

hdfs dfs -ls /
hdfs dfs -mkdir /input

Go to the folder where Holmes.txt is copied and from that folder run the command

hdfs dfs -copyFromLocal Holmes.txt /input

3. Run the hadoopmapreduce-examples-2.4.1.jar as follows:

hadoop jar /home/ec2-user/hadoop-2.4.1/share/hadoop/mapreduce/hadoopmapreduce-examples-2.4.1.jar wordcount /input /out1

(Note: The installation version is 2.6.0, so the command may vary accordingly. I had Hadoop 2.4 at the time of running this jar)

‘/input’ indicates the input directory where Holmes.txt file is stored and ‘out1’ a dir where results file is stored.

4. To see the results file,

hdfs dfs -cat /out1/part-r-00000

PART 4: Running a Hadoop Program on Eclipse IDE

The jar files required to run a Hadoop job depends on the program you are writing.
You can include all the files that come with Hadoop-2.6.0.tar.gz.

  • Make a new folder in your project and copy paste all the required files into it.
  • Select the Jar files and right click.
  • Select Build Path > add to build path.

Now all the errors will disappear and you can run the hadoop java code that you wrote.

PART 5: Running a Program exported from your Eclipse IDE in Hadoop

1. Right click the project and select export.
2. Select Jar File option and check the option to include sources.
3. Click Finish.
4. You can use scp to transfer file to EC2 or you can use it locally.

To run your job on Hadoop:

hadoop jar /path/to/your/jar/my_project.jar package_name.main_class /hdfs_input /hdfs_output

Note: Make sure you have the required input files on HDFS. (use hdfs dfs –copyFromLocal …)

TURN OFF Hadoop

Stop Hadoop by running the following command:

stop-all.sh

OR
Stop individual services by following commands:

stop-dfs.sh
stop-yarn.sh

hadoop stop

Hadoop: Setting up Microsoft Azure HDinsight

Just like AWS has Elastic Map Reduce with pre-installed Hadoop ecosystem, Google has Apache Hadoop with bdutil controller to enhance its functionalities, Azure has recently launched hdinsight, their own Hadoop provisioning service on the azure cloud. It is built using Hortonworks Data Platform (HDP) distribution of hadoop.

Let’s get started and explore the capabilities it offers.

1. Login to the manage.windowsazure.com portal, click the hdinsight service and click on CREATE AN HDINSIGHT CLUSTER.

HD1

2. Select Hadoop and provide details like cluster name, password and the number of Hadoop nodes you want.

HD2

3. If you don’t have azure power shell, you can download it from here.

If you are using power shell for the first time, you will need to provide your azure account credentials to power shell using this command:

Add-AzureAccount

Next tell the power shell the name of the hdinsight cluster you just created, using this command:

Use-AzureHDInsightCluster sagarshadoop

Azure’s hdinsight has some sample data and the power shell cli has its own way of executing hadoop related commands. Here is an example you can try.

Invoke-Hive "select country, state, count(*) as records from hivesampletable group by country, state order by records desc limit 5;"

HD3

4. On the hdinsight screen in manage.windowsazure.com, you will see a button named “Query console”. Click on that and provide your hdinsight username and password.

HD4

There are some pre-built solutions like twitter trend data analysis on sample data provided by hdinsight. They are pretty simple, just have to follow the instructions provided by them and you can quickly learn to use hdinsight as per your application needs by following those solutions.

The hdinsight service has launched recently and needs a lot of improvement. It is very easy to use and has taken away my pain of setting up a hadoop cluster on my laptop and setting up a multi-node cluster on virtual machines. I am confident that the hdinsight service will have a bright future. Good job Microsoft.

Getting started with Microsoft Azure Cloud

I have been using Amazon web services for over a year now. Mostly I use EC2 to run my apps on a cloud server so that I can share my application URL with my friends. AWS is great, undoubtedly, but I couldn’t help advertising Windows Azure.

The Virtual Machines services provided by Azure makes the creation and use of virtual cloud servers a piece of cake. Without further ado, lets see how quickly we can spin up a VM.

1. Go to manage.windowsazure.com and login.

Click on Virtual Machines tab and then click on “CREATE A VIRTUAL MACHINE”.

VM1

2. Write down the name of the server in “DNS NAME”. In the IMAGE section select the type of operating system you would like to use. Keep the SIZE low as you will be charged accordingly. Set the username and password. This will be used to log in to the machine through terminal. Set the region according to your geographical location to avoid latency. Usually there isn’t any major lag and you can choose any region, but I would suggest you to keep the VM in the nearby region.

VM2

3. Give it a few minutes and the VM status will become running. Click on the VM and you will see its details as shown in the picture. Note the PUBLIC VIRTUAL IP ADDRESS.

VM4

4. Accessing the VM:

If you are using MAC OS or any Linux distro, you can login to the Azure VM with this command:

ssh azureuser@104.209.33.218

OR

ssh azureuser@sagar12345.cloudapp.net

It will prompt you for a password. Provide the one which you used while creating the VM.

If you are using Windows, you can download Windows Powershell from here.

For first time powershell users, you will have to add your azure account details into powershell using this command:

Add-AzureAccount

Next step is to connect to your VM. Use the above commands mentioned for MAC OS and Linux to securely login to the machine.

VM3

That’s it. Simple enough. I hope you have fun and launch a couple of applications on the Azure VM.

Note that there are ton of commands provided by the Azure Powershell to make the cloud infrastructure management easier. You can get the details of those commands from here.

OpenStack installation on Virtual Machines

For those of you who are new to OpenStack, I recommend you guys to read its wiki page and check out its website. This blog describes the easiest way to install OpenStack (according to me). Like many of you checking this blog, I too wanted to install OpenStack just for fun and learning purpose. I have tried various ways  like RDO, Devstack and on cloud providers like AWS, Google Cloud and Microsoft Azure. I had some luck with Devstack, but it doesn’t provide all the services by default. I found it very difficult to install OpenStack on others. Luckily I found an easy way to install OpenStack. The Mirantis way! Mirantis has tools and scripts for setting up OpenStack. In this blog we will be installing OpenStack on VirtualBox VMs using Mirantis OpenStack. Note that this method of installation requires a laptop with minimum 8GB RAM and good processor configuration. I am pretty confident that following these simple steps will save you a ton of time and headache. 1. Download the latest virtual box from here. Also download the extension pack for virtual box from the same link. 2. Download Mirantis scripts from here. (look for the link named “Virtualbox scripts” below the Download button) 3. Download Mirantis OpenStack ISO from here. Extract the virtualbox scripts and put the mirantis ISO file in the ISO folder in the scripts. 4. ( Check the configurations in config.sh. They have 3 node cluster setup for laptop’s with 8GB RAM and 5 node cluster for laptop’s with 16GB RAM. You can edit those configurations according to your need, but I would suggest you don’t modify it if you are just trying to learn openstack. ) Run the script launch.sh. ./launch.sh 5. Select USB install option when the VM named fuel-master starts. It takes around 15 minutes to setup docker containers inside the VM and install necessary packages. You will repeatedly see the console message: EXT4-fs (dm-8): Mounted filesystem with ordered data mode. Opts: 6. A Fuel setup option will appear after some time as shown in the pic below. mid If you don’t have any idea of your network configurations, leave everything as it is and click on Quit Setup > Save and Exit For those who want to experiment with the networking activities, these shell commands will come in handy.

  • hostname (The first name before first . is host, the next part is domain)
  • cat /etc/resolv.conf (Will give you a list of nameservers. Pick any one for ‘default gateway’ part)
  • route -n get default (Will provide you with default gateway used for internet access)

After this the installation of fuel will begin and will continue for about 15 minutes. Once the installation is complete you should see Fuel master VM up and running as shown below. OS1 Within a minute or two, our script launch.sh will create three more virtual box VMs (Fuel Slaves). The Fuel master will do a PXE boot of those VM’s to install CentOS on them. Below is the picture of such activity in progress. OS2 7. When the fuel server is ready point your browser to http://10.20.0.2:8000/ and provide the username and password set during installation or the default username password provided by fuel on the console. Now create a new environment. The below pictures show the configurations for my environment. OS3 OS4 OS5 OS6 OS7 8. Configuring the Nodes: Click the add nodes option and select the configurations as mentioned below. node 1: Controller node 2: Compute node 3: Cinder Although you can select any services you want in any nodes, I would recommend you to use above configurations for simple usage and fast installation. After adding nodes, your screen might look like this. DO NOT DEPLOY YET!!!! OS8 9. Settings tab: Go to settings tab. Here we need to fill in the ssh key and the NTP server (if you did not set it during installation.) Copy the ssh public key from: vi ~/.ssh/id_rsa.pub If you do not have an ssh key, check out how to generate an ssh key. Add any one of the ntp server through NTP server’s website http://www.pool.ntp.org Now click on Deploy Changes. OS9 10. The whole deployment process may vary from 45 minutes to 2 hours. Some warning messages might appear. Here are few of the error messages I faced and how I dealt with them.

  • Node status went to ‘OFFLINE’:- Wait for some time. Probably the node is rebooting. It will come back to installation status after some time.
  • Too many nodes failed to provision:- Wait for some time and then select that node and click on deploy changes. It should continue with its installation.

OS10

  • Timeout error: In this case, my guess is that the node will need to be redeployed. Just select the node and click on deploy changes. It will start installation process again and you should be good to go.

11. Once the installation is complete, Fuel UI will show a URI at which Horizon is hosted. It should look like this. OS12 Note: The status will show 100% and will stay like that for a while. Wait till the Success message arrives and the status above changes to Operational. OS13 Congrats, you have setup OpenStack with all of its services. Go Play!