Hadoop: My struggles with the Beast

The Intro

A while back, a good friend of mine gave me a book on Hadoop; Hadoop in Action. We were working on a project together in Malaysia on collecting big data, storing and processing this in real time; and I became very interested in the concepts.

When I returned to Sydney I bought a beefy machine which I was going to use as my Hadoop test bed. An Intel 3930K i7 Processor with 32 GB of RAM, an AsROCK Extreme6 motherboard and a heatsink the size of a small child.

The night before I bought the machine I googled all about processors that would be good for a bunch of virtual machines. After stumbling around online I found that the 3930K was a great CPU; and with the heat sinks, I could potentially over clock this beast to around 4.5 GHz (its current clock speed is 3.2GHz). Apparently people have even reached 5.0 GHz with my setup.

Anyway, I digress ... The purpose of this blog is to document the great fun I had setting up the environment. My goal so far is a simple 6 machines; 1 name node, 1 job tracker, 1 secondary name node, and 3 data and task tracker nodes.

My fun with VMWare HyperVisor

I decided to use the Free ESXi VMWare Hypervisor for my VM needs; I figured if I'm going to use a Virtual appliance I might as well use the best one right? As I started mucking around with it; I realised how difficult it is to set up. Each step of the way was painful.

Installation was smooth. No problems at all. I started the baby up expecting it to whiz into action. And it just sat there ... Turns out my ethernet card is not supported. Following this blog I managed to take the ISO I had, add an updated ethernet driver TO THE ISO, reinstall ESXi and hey presto it works. Why there is no USB support, CD support, or any support on the ESXi box itself is beyond me :/ (Apparently USB is supported via this link but I never got it working ... EVER).

So finally, with ESXi installed and network ... well working; I decided to sit in front of the comfy chair I set up in front of my awesome monitor in front of the beefy computer I just managed to start... to find out that you can't actually access the host. You must connect to the VMs on there; the host is a barebones linux box that doesn't even have a web browser! My bad really; I mean this is a data centre grade piece of enterprise software; you don't really expect people to sit in front of such a beast...

Having started it up and poked around a bit, I realised ... I have no idea how to use this baby. So after mucking around with it and google for a while, I finally figured that you have to
1. Have Windows so you can access the web console to actually do anything useful (huh?)
2. The web console is actually just a page containing a bunch of links to yet more software to download to actually use the thing. By the way, the software itself is actually only available in windows ... which meant my glorious little XP box became my most important gateway to my beefy machine and its 9 GB of hard drive space was no way enough to do anything ... SIGH

Anyway, I continued, I battled, and finally installed it. It took FOREVER to transfer anything to that machine (this was the fault of my network really ... but come on USB 3.0 support maybe??). I had a few VMs that I wanted to try out on this brand new hypervisor; took around 30 minutes to transfer each. When I started it up .... BAM Invalid Disk Type 7 Error. Turns out VMWare, Virtual Box, and Virtual Player disks, even when converted to VMWare VMDK format, is not supported. You have to use the steps described here to fix it.

So finally, having a working hadoop installation I decided to start cloning the machines. SSH, simple cp command, and the 25 GB file took 2 hours to clone ... I had another 20 to do, so .... something was seriously wrong!!! Apparently the cp command in ESXi is HORRIBLE. Apparently this is a known issue, and the solution is to use vmkfstool (which is an incredible swiss army knife of awesome).

For example, if I wanted to copy a bunch of virtual machines, I would use the following command:

vmkfstools -i /vmfs/volumes/esxpublic/testvm2.vmdk /vmfs/volumes/production/testvmnew2.vmdk

This would clone the first vmdk to the location of the second vmdk. I did this inside the ESXi server itself.

For those interested, ESXi stores their data stores in the directory ~/vmfs/volumes/<datastore name>.

So I finally got the VM up and running, and things started to look good. I had 20 nice little virtual machines, a script I could use to create more if I wanted. The next thing should be easy right? Configure Hadoop as a cluster. That's when things got really bad :/

The Joys of Configuring Hadoop

Firstly, I decided to set up hostnames for all the machines. This involves actually configuring the /etc/hosts file, and changing the host name for each server; on every single machine. Since I didn't have access to a proper DNS Server, I had to hard code each host. I made sure that my router would give the same IP Address to the same host every time using each VMs MAC address, but it was still annoying.

So then I followed the guide in the Hadoop In Action guide (chapter 2) to deploy a cluster. I wanted name node, 1 job tracker, 1 secondary name node, and 3 data nodes (which also has the task tracker). I made sure that the name nodes could SSH to any other node without the need to supply a password (just use this guide). The Job Tracker needs to be able to SSH to the task trackers without passwords.

Before I go further, I suggest that you create a few scripts.
1. A script that will allow you to send commands to multiple machines one at a time (using ssh <hostname> 'command', e.g. ssh 192.168.1.1 'ls -lrt')
2. A script that allows you to copy (using scp; e.g. scp file <host>:<path>)

I would also suggest creating a symbolic link on each machine to the configuration files. That way when you change it once, you can use these scripts to quickly deploy to all the other machines. I found that Hadoop needs all the configurations to be the same, and having those scripts saved my sanity.

These configuration files are found in $HADOOP_HOME/conf. The files I changed were:
1. core-site.xml: I simply added the property fs.default.name, and set it to hdfs://<namenode>:9000

<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-namenode:9000</value>
</property>

2. mapred-site.xml: I added the Job Tracker location to this file. Property was mapped.job.tracker

<name>mapred.job.tracker</name>

<value>hadoop-jobtracker:9007</value>

</property>

I saw a lot of sites suggest putting hdfs:// in front. I think it works if you just supply the IP Address (or host) and port.

3. hdfs-site.xml: dfs.replication.name was changed to 3, so the system would replicate the same blocks in each file 3 times across the cluster.

<name>dfs.replication</name>

</property>

After all that was configured and deployed to all the machines, I ran hadoop namenode -format on the name node machine and then ./start-all.sh in $HADOOP_HOME/bin.

Something important to note. hadoop namenode -format will ask you if you are sure you want to format. Enter the capital letter Y and not the lowercase y, otherwise it will just say format was aborted and you'll think (like me) that it was because of some error in the system.

Worked?? Not at all! I got a whole bunch of Permission denied errors:

mkdir: cannot create directory `/var/log/hadoop/spuri2': Permission denied
chown: cannot access `/var/log/hadoop/spuri2': No such file or directory
/home/spuri2/spring_2012/Hadoop/hadoop/hadoop-1.0.2/bin/hadoop-daemon.sh: line 136: /var/run/hadoop/hadoop-spuri2-namenode.pid: Permission denied
head: cannot open `/var/log/hadoop/spuri2/hadoop-spuri2-namenode-gpu02.cluster.out' for reading: No such file or directory
localhost: /home/spuri2/.bashrc: line 10: /act/Modules/3.2.6/init/bash: No such file or directory
localhost: mkdir: cannot create directory `/var/log/hadoop/spuri2': Permission denied
localhost: chown: cannot access `/var/log/hadoop/spuri2': No such file or directory

(it was sort of like that above, I deleted the logs I got, that was sourced from stack overflow)

Well, turns out that in the conf directory there is a file called hadoop-env.sh. I had to set the HADOOP_CONF_DIR to a directory that the user had permissions to start. Originally I thought if I just use sudo I could start hadoop. Boy was I wrong. What happened was that it tried to start all the other components of the cluster using root. I supplied the root users and that just went from bad to worse! I ended up having to connect to each machine to manually delete all the Java processes...

So after I changed the configuration directory, I had restarted the cluster and got errors complaining about being unable to write to the /var/run/pid directory

starting namenode, logging to /var/log/hadoop/hduser/hadoop-hduser-namenode-sepdau.com.out
/usr/sbin/hadoop-daemon.sh: line 136: /var/run/hadoop/hadoop-hduser-namenode.pid: No such file or directory
localhost: mkdir: cannot create directory `/var/run/hadoop': Permission denied

(another stack overflow thing)

Turns out I have to configure the HADOOP_PID_DIR to something my hadoop user could write as well ...

So changing that, and deploying all changes to the rest of my clusters I finally started it using ./start-all.sh. Everything started except my Job Tracker ... I kept getting the error:

2013-02-07 09:37:20,302 FATAL org.apache.hadoop.mapred.JobTracker: java.net.BindException: Problem binding to hadoop-jobtracker/192.168.1.14:9007 : Cannot assign requested address
at org.apache.hadoop.ipc.Server.bind(Server.java:225)
at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:296)
at org.apache.hadoop.ipc.Server.<init>(Server.java:1393)
at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:505)
at org.apache.hadoop.ipc.RPC.getServer(RPC.java:466)

Regardless of what I did. I scoured the web trying to find the solution. Finally found information from this website which explained that what I should do is run in two phases;

- start-dfs.sh to start the name node, secondary node, and the data nodes.

- then SSH to the job tracker and start that using start-mapred.sh which will start the job tracker and task trackers. That response on the website explains it quite well.

So it all worked?! Not at all ... for some reason I got a bunch of errors complaining about mismatched version ids from the

2013-02-07 10:04:39,181 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-hadoop-user/dfs/data: namenode namespaceID = 398961412; datanode namespaceID = 1497962745
at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148)
at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:341)
at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:258)

Fortunately I found the solution on this guy's website. After that, I started the name node, and the job tracker. Then I visited the pages of each of the data nodes, and after a while it finally worked!!!

I uploaded the a bunch of files into the Hadoop File System:

(command I ran was: hadoop fs -lsr /)

-rw-r--r-- 3 hadoop-user supergroup 674566 2013-02-07 10:20 /experiments/book1-outlineofscience.txt

-rw-r--r-- 3 hadoop-user supergroup 1423801 2013-02-07 10:20 /experiments/book2-leodavinci.txt

-rw-r--r-- 3 hadoop-user supergroup 1573150 2013-02-07 10:20 /experiments/book3-gutenberg.txt

The 3 in the directory dump represents the number of times the file blocks have been replicated in the system. Pretty sweet eh?

Important Stuff

BTW, before this, I was getting a lot of messages in the log files saying:

java.io.IOException: File /tmp/hadoop-hadoop-user/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1376)

at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:539)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

at java.lang.reflect.Method.invoke(Unknown Source)

This means that you messed up your configuration. If you visit your name node server you might see that you have no data nodes. Before I got the whole thing working flawlessly, the web pages was showing 0 data nodes.

Use the following URLs to visit the nodes (borrowed from this website):

http://<namenode>:50070/ – web UI of the NameNode daemon
http://<job tracker>:50030/ – web UI of the JobTracker daemon
http://<data node>:50060/ – web UI of the TaskTracker daemon

You can replace name node, job tracker and data node with the ip or host of the respective servers.

So I finally have a working Hadoop Cluster! Now the next step I want to try is

- Do something useful with it

- Add some more data nodes and task trackers to it

- Try out some more advanced configuration options (like Kerberos security).

I'd also like to try to use other tools that work with Hadoop. But for now, I'm just admiring my 3 node cluster. I think I'll start mucking around with some live data now :)

Hadoop

Thursday, February 7, 2013

My struggles with the Beast

The Intro

My fun with VMWare HyperVisor

The Joys of Configuring Hadoop

Important Stuff

No comments:

Post a Comment