building a virtualized 5-node HDP 2.0 cluster (all within a mac)
This blog posting's content was originally on the Build a Virtualized 5-Node Hadoop 2.0 Cluster wiki page, but it just made sense to refactor it into a blog posting based on the short shelf-life it has due to changes in the ever-evolving HDP stack.
This write-up is designed to capture the steps required to stand up a 5-node HDP2 (Hortonworks Data Platform) Hadoop 2.0/YARN cluster (with 2 master nodes & 3 worker nodes) running on CentOS 6.4 – all executing within VirtualBox. Of course, you'll need a beefy host machine to run all of these 5 guest machines within. To help me out, the good folks at Hortonworks outfitted me with a MacBook Pro that has a 2.3 GHz i7 processor, 16GB of ram, and a 500GB SSD. Yes... I know... I'm lucky!
HDP 2.0.9.0 was utilized for this tutorial. Documentation for the most current version of HDP is available at http://docs.hortonworks.com/.
When we are done, we will have a cluster like the visualization below.
- 1 First Master Node
- 1.1 Minimal Installation using DHCP
- 1.2 Configure Static IP
- 1.3 Hadoop Installation Prep
- 1.3.1 SSH Setup
- 1.3.2 Disable Key Security Options
- 1.3.3 Setup NTP Service
- 1.3.4 Flush Networking Rules
- 1.4 Pre-Configure Hosts
- 1.4.1 Update Hosts File
- 1.4.2 Set the Hostname
- 1.5 Create Node Appliance
- 2 Remaining Nodes
- 3 Install Cluster via Ambari
First Master Node
After you get VirtualBox setup, then we need to get a base install of CentOS going that we can use as a baseline. We can then tweak it for the other nodes. As we'll be doing a "Minimal Install", we only need the Disk1 ISO which you can get here.
Minimal Installation using DHCP
As mentioned above, some of this is an "explore as you go" exercise and I surely don't want to give you screenshots of every single step of setting up CentOS to run within a VirtualBox VM. Fortunately, there are good examples out there in the wild and some quick google searching yields some good links. For this exercise, I suggest taking a look at the write-up at http://webees.me/installing-centos-6-4-in-virtualbox-and-setting-up-host-only-networking/. It is not perfect (and the author, like me, leaves some things out), but it should get you going. The page at http://extr3metech.wordpress.com/2012/10/25/centos-6-3-installation-in-virtual-box-with-screenshots/ could also be reviewed prior to starting this step.
Here are some specific items that need to be addressed during the setup (again, above-n-beyond the basics identified in the earlier tutorials) that are focused on our need to build a multi-node Hadoop cluster.
In the VirtualBox UI:
Name the host 5N-HDP2-M1. More on the naming convention later.
Set the memory to 2GB and create a 20GB (dynamically allocated) hard drive.
For the Network options, set Adapter 1 to use NAT and set Adapter 2 to use the Host-only Adapter identified as vboxnet0 which will allow the host OS to access the CentOS VM. We'll move this second adapter from DHCP to static later.
During the CentOS installation setup:
When it is time to select the Hostname, use "m1.hdp2" (again, more on naming convention later).
On that same screen, click the Configure Network button to edit each of the two listed "System ethX" network connections and select the Connect automatically checkbox for both.
For simplicity's sake (i.e. were going to end up with a lot of passwords!) just use "hadoop" for the root password.
Stick with the default "Minimal installation" type.
If all goes well, then you'll end up with a CLI-only Linux box that you can now log into.
CentOS release 6.4 (Final)
Kernel 2.6.32-358.el6.x86_64 on an x86_64
m1 login: root
Password:
[root@m1 ~]# _At this point, we're using DHCP so run ifconfig to see what IP address is being used (my eth1 connection reported 192.168.56.102, but yours may vary). Then use a terminal program from your host to make sure you can log into the newly created virtual machine. And once there, ping an external server to verify it can reach back out.
hw10653:~ lmartin$ ssh root@192.168.56.102
root@192.168.56.102's password:
Last login: Thu Jan 16 15:34:32 2014 from 192.168.56.1
[root@m1 ~]#
[root@m1 ~]# ping www.google.com
PING www.google.com (74.125.227.113) 56(84) bytes of data.
64 bytes from dfw06s16-in-f17.1e100.net (74.125.227.113): icmp_seq=1 ttl=63 time=46.0 ms
64 bytes from dfw06s16-in-f17.1e100.net (74.125.227.113): icmp_seq=2 ttl=63 time=43.2 ms
^C
--- www.google.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1531ms
rtt min/avg/max/mdev = 43.230/44.615/46.001/1.401 ms
[root@m1 ~]# Configure Static IP
Edit /etc/sysconfig/network-scripts/ifcfg-eth1 to make the following changes:
Make sure
ONBOOTis set toyes(will be if you selected the Connect automatically checkbox during installation and mentioned above)Change
NM_CONTROLLEDfromyestonoChange
BOOTPROTOfromdhcptononeAdd the following lines
IPADDR=192.168.56.41NETMASK=255.255.255.0
Remove the lines that start with the following
UUIDHWADDR
Edit /etc/sysconfig/network-scripts/ifcfg-eth0 and remove the line starting with UUID. Run rm /etc/udev/rules.d/70-persistent-net.rules to flush out the existing network settings and then issue a shutdown -r now to reboot the machine.
After the box is restarted try to ping 192.168.56.41 (the virtual machine "m1" that we just built), ssh to it, and once there ping back to the guest host (the vboxnet0 identifies it as 192.168.56.1) to verify it is routable in & out. Also, verify that you can still ping addresses in the wild.
hw10653:~ lmartin$ ping 192.168.56.41
PING 192.168.56.41 (192.168.56.41): 56 data bytes
64 bytes from 192.168.56.41: icmp_seq=0 ttl=64 time=0.297 ms
64 bytes from 192.168.56.41: icmp_seq=1 ttl=64 time=0.345 ms
^C
--- 192.168.56.41 ping statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.297/0.321/0.345/0.024 ms
hw10653:~ lmartin$
hw10653:~ lmartin$ ssh root@192.168.56.41
root@192.168.56.41's password:
Last login: Sat Jan 18 03:28:27 2014 from 192.168.56.1
[root@m1 ~]#
[root@m1 ~]# ping 192.168.56.1
PING 192.168.56.1 (192.168.56.1) 56(84) bytes of data.
64 bytes from 192.168.56.1: icmp_seq=1 ttl=64 time=0.201 ms
64 bytes from 192.168.56.1: icmp_seq=2 ttl=64 time=0.230 ms
^C
--- 192.168.56.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1428ms
rtt min/avg/max/mdev = 0.201/0.215/0.230/0.020 ms
[root@m1 ~]#
[root@m1 ~]# ping www.google.com
PING www.google.com (74.125.227.115) 56(84) bytes of data.
64 bytes from dfw06s16-in-f19.1e100.net (74.125.227.115): icmp_seq=1 ttl=63 time=48.0 ms
64 bytes from dfw06s16-in-f19.1e100.net (74.125.227.115): icmp_seq=2 ttl=63 time=105 ms
^C
--- www.google.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1481ms
rtt min/avg/max/mdev = 48.053/76.825/105.598/28.773 ms
[root@m1 ~]# Hadoop Installation Prep
Before we create the additional nodes using this one as the base to start from, there are some additional steps we should take so we won't have to do them for each VM. Later in this document we will utilize the Ambari documentation to setup our cluster, but there are some items within that documentation that are useful at this time.
SSH Setup
First, run yum install openssh-clients and then use ssh localhost to verify it is working.
Next, follow the instructions at http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.0/bk_using_Ambari_book/content/ambari-chap1-5-2.html to allow password-less SSH connections.
[root@m1 ~]# ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
23:4a:e9:9f:c9:ea:91:1f:9b:bf:98:e2:16:af:0c:24 root@m1.hdp2
The key's randomart image is:
+--[ RSA 2048]----+
| |
| |
| |
| . |
|E . o . S |
| o o.o . . |
| . =o. |
| oo=.O |
| +*+@.o. |
+-----------------+
[root@m1 ~]# cd .ssh
[root@m1 .ssh]# ls -l
total 12
-rw-------. 1 root root 1675 Jan 18 04:15 id_rsa
-rw-r--r--. 1 root root 394 Jan 18 04:15 id_rsa.pub
-rw-r--r--. 1 root root 391 Jan 18 04:09 known_hosts
[root@m1 .ssh]#
[root@m1 .ssh]# cat id_rsa.pub >> authorized_keys
[root@m1 .ssh]# ls -l
total 16
-rw-r--r--. 1 root root 394 Jan 18 04:17 authorized_keys
-rw-------. 1 root root 1675 Jan 18 04:15 id_rsa
-rw-r--r--. 1 root root 394 Jan 18 04:15 id_rsa.pub
-rw-r--r--. 1 root root 391 Jan 18 04:09 known_hosts
[root@m1 .ssh]#
[root@m1 .ssh]# chmod 600 authorized_keys
[root@m1 .ssh]# ls -l
total 16
-rw-------. 1 root root 394 Jan 18 04:17 authorized_keys
-rw-------. 1 root root 1675 Jan 18 04:15 id_rsa
-rw-r--r--. 1 root root 394 Jan 18 04:15 id_rsa.pub
-rw-r--r--. 1 root root 391 Jan 18 04:09 known_hosts
[root@m1 .ssh]#
[root@m1 .ssh]# ssh localhost
Last login: Sat Jan 18 04:09:59 2014 from localhost
[root@m1 ~]# Lastly, create a file named config in /root/.ssh that contains StrictHostKeyChecking no as the only line.
[root@m1 ~]# cd /root/.ssh
[root@m1 .ssh]# echo 'StrictHostKeyChecking no' >> config
[root@m1 .ssh]# cat config
StrictHostKeyChecking no
[root@m1 .ssh]# Disable Key Security Options
Security-Enhanced Linux (SELinux) is a "good thing" and I'm not recommending you not use it in your "real" environments, but for our purposes let's just disabled it. To do this simply edit /etc/selinux/config and change the SELINUX value from enforcing to disabled.
The Linux firewall, iptables, is also another "good thing", but let's make our environment easier to install to by disabling it. Run the following commands.
[root@m1 ~]# chkconfig iptables off
[root@m1 ~]# chkconfig ip6tables offReboot and make sure the following output is seen to ensure iptables are not running at system startup.
[root@m1 ~]# service iptables status
iptables: Firewall is not running.
[root@m1 ~]# Setup NTP Service
The clocks on all nodes in the cluster will need to be in-sync and the NTP service will take care of this for us. Execute yum install ntp to enable this service. Then type the following commands.
[root@m1 ~]# chkconfig ntpd on
[root@m1 ~]# service ntpd start
Starting ntpd: [ OK ]
[root@m1 ~]# Flush Networking Rules
Perform the following as the final step (as a final flush of networking rules) in our preparation of a base VM to help create the other nodes.
[root@m1 ~]# cd /etc/udev/rules.d
[root@m1 rules.d]# rm 70-persistent-net.rules
rm: remove regular file `70-persistent-net.rules'? y
[root@m1 rules.d]# Pre-Configure Hosts
When finished we will have the following 5 hosts in our cluster.
VirtualBox Name | OS Hostname | IP Address |
|---|---|---|
5N-HDP2-M1 | m1.hdp2 | 192.168.56.41 |
5N-HDP2-M2 | m2.hdp2 | 192.168.56.42 |
5N-HDP2-W1 | w1.hdp2 | 192.168.56.51 |
5N-HDP2-W2 | w2.hdp2 | 192.168.56.52 |
5N-HDP2-W3 | w3.hdp2 | 192.168.56.53 |