Hadoop

Wednesday, February 20, 2013

Moving on ...

After a few weeks working with Hadoop, setting up my little cluster, and running a bunch of Map Reduce tasks on it, I finally got too frustrated with the whole speed thing and gave up.

The goal I had was to implement a Data Warehouse solution within Hadoop using the barebones Hadoop software. I guess the use case for it just doesn't map with what Hadoop offered; which was a giant distributed file system.

I tried putting stuff on top of the cluster; such as Hive, HBase, and all that fancy stuff but it just wasn't giving me the performance I wanted; which was the ability to process data in real-time as it arrived.

When I say I'm giving up on it, what I really mean is that I'm giving up on using Hadoop and Map-Reduce to process my data. I've been experimenting now with the other libraries on offer, and it appears that SHARK and SPARK are quite promising (1 TB in 7 seconds anyone?). Its still not up to the speed I want, but its definitely faster than what I had originally.

I'll keep this blog going about how I'm doing with the SHARK/SPARK combo. I heard some good things about it though its very bleeding edge. I guess only time will tell whether it will mature into a competitive product.

I've also been looking at the solution from Cloudera, but its not ready yet. Apparently they are planning a GA release early April 2013. That's also rather promising; they too gave up on Map Reduce and instead opted for a much more efficient model.

Here are some slides on the topic:
(Cloudera) http://www.slideshare.net/cloudera/cloudera-impala-a-modern-sql-engine-for-apache-hadoop

(Spark + Shark) http://www.slideshare.net/Hadoop_Summit/spark-and-shark

Friday, February 8, 2013

Hadoop and FQDN

I was mucking around with Hadoop today, trying to get a few more nodes up and running. The tricky bit about deploying new nodes is the setting up of the node names, the ip addresses, and the sharing of the ssh keys.

I wrote a nice script that would deploy all the information I needed onto each server in one go. This allows me to quickly send commands, copy files, etc without using passwords. Its great for managing servers.

Anyway, the trouble I had today was trying to figure out why every single one of my task trackers were called localhost. Every single task tracker that my job tracker was managing; every data node that my name node was managing; were all called localhost.

This drove me crazy; try as I might I couldn't figure it out. I searched the web for answers; and finally found it. Apparently Hadoop uses the Fully Qualified Domain Name of the host. If you run hostname -f (or hostname -fqdn for those pedantic guys) you get the name of the host. And this is what Hadoop Task Tracker uses when it starts up.

The tricky thing is that in most cases this was localhost because of the way the /etc/hosts file was set up. 127.0.0.1 always mapped to localhost first (in my config anyway) so when it started it used that first. So the simple fix is to move localhost in the /etc/hosts file to the end and it worked!

Here is a reference of the post I found online that explained it.

Thursday, February 7, 2013

Common Newbie Pitfalls in Hadoop Java Applications

Just putting down common issues I faced while working with writing Java applications in Hadoop and how I resolved them.

Class not Found Exceptions

java.lang.RuntimeException: java.lang.ClassNotFoundException: net.victa.hadoop.example.WordCount$MapClass
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:195)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:593)

It turns out that in your code, you have to add the line

job.setJarByClass(WordCount.class);

Otherwise Hadoop and its nodes will have no idea how to find the classes. I found this from this site (just remove the commented line and it will load)

This log message appeared just before the Class Not Found exceptions in my hadoop log:

WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).

Pretty clear that I was missing jar files, and the solution was pretty simple as well.

Mismatch Key from Map

java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable

This was a bit more annoying. It turns out that the Hadoop In Action book was a bit outdated so the code examples in that book didn't quite work. A lot of the stuff was deprecated. This site explained it better. Afterwards I took a copy of the files in this site to make it work

File Does Not Exist and Error Reading Task, Connection Refused

java.io.FileNotFoundException: File does not exist: /experiments/bob
WARN mapred.JobClient: Error reading task outputConnection refused

These issues were just me being silly. I supplied the path /experiments as the word count directory. In this there was another directory called bob. Wordcount treated all the things it saw in the input directory as files, and failed to read that file (bob). So another gotcha that you have to be careful of.

Working with Hadoop on a Client Machine

After getting the VMs configured and running, I started mucking around with actually running my very own piece of code; which was simply just a copy of the WordCount example. My goal was to try and run this external to the name node server.

So how the heck do I do that??

Well, the first thing I did was take a copy of the name node server's installation of Hadoop. That way I was sure I had the same version. Then I modified the contents of the hadoop-env.sh file (in conf) so that it reflected my local machine's settings.

I'm using a Mac Book Air, and I have java installed in the directory /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/

After carefully configuring everything, I verified it installed correctly by typing:

bin/hadoop fs -ls /

This returned back my file system; which was pretty sweet. I had a minor hiccup where I used /etc/hosts in each machine to hard-code each IP Address but my mac didn't know how to resolve the names. Just updating the /etc/hosts file was good enough. So far I just added the name node to the list.

I tried starting my Java application to count words (a simple start; much like a hello world app) and the results were a lot of exceptions:

bin/hadoop jar Hadoop-WordCount.jar net.victa.hadoop.example.WordCount /experiments /experiments-output
Exception in thread "main" org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=<user>, access=WRITE, inode="mapred":hadoop-user:supergroup:rwxr-xr-x
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

The easiest solution I found is to disable security altogether (see this post). Then you can upload whatever files you want and access whatever you wanted.

I modified the file hdfs-site.xml on the name node server; adding the property dfs.permissions and setting it to false

hdfs-site.xml<property>
<name>dfs.permissions</name>
<value>false</value>
</property>

Obviously this is a bad thing to do in production. Since this is a test and I really wanted to run my little program I wanted to take the path of least resistance :)

My struggles with the Beast

The Intro

A while back, a good friend of mine gave me a book on Hadoop; Hadoop in Action. We were working on a project together in Malaysia on collecting big data, storing and processing this in real time; and I became very interested in the concepts.

When I returned to Sydney I bought a beefy machine which I was going to use as my Hadoop test bed. An Intel 3930K i7 Processor with 32 GB of RAM, an AsROCK Extreme6 motherboard and a heatsink the size of a small child.

The night before I bought the machine I googled all about processors that would be good for a bunch of virtual machines. After stumbling around online I found that the 3930K was a great CPU; and with the heat sinks, I could potentially over clock this beast to around 4.5 GHz (its current clock speed is 3.2GHz). Apparently people have even reached 5.0 GHz with my setup.

Anyway, I digress ... The purpose of this blog is to document the great fun I had setting up the environment. My goal so far is a simple 6 machines; 1 name node, 1 job tracker, 1 secondary name node, and 3 data and task tracker nodes.

My fun with VMWare HyperVisor

I decided to use the Free ESXi VMWare Hypervisor for my VM needs; I figured if I'm going to use a Virtual appliance I might as well use the best one right? As I started mucking around with it; I realised how difficult it is to set up. Each step of the way was painful.

Installation was smooth. No problems at all. I started the baby up expecting it to whiz into action. And it just sat there ... Turns out my ethernet card is not supported. Following this blog I managed to take the ISO I had, add an updated ethernet driver TO THE ISO, reinstall ESXi and hey presto it works. Why there is no USB support, CD support, or any support on the ESXi box itself is beyond me :/ (Apparently USB is supported via this link but I never got it working ... EVER).

So finally, with ESXi installed and network ... well working; I decided to sit in front of the comfy chair I set up in front of my awesome monitor in front of the beefy computer I just managed to start... to find out that you can't actually access the host. You must connect to the VMs on there; the host is a barebones linux box that doesn't even have a web browser! My bad really; I mean this is a data centre grade piece of enterprise software; you don't really expect people to sit in front of such a beast...

Having started it up and poked around a bit, I realised ... I have no idea how to use this baby. So after mucking around with it and google for a while, I finally figured that you have to
1. Have Windows so you can access the web console to actually do anything useful (huh?)
2. The web console is actually just a page containing a bunch of links to yet more software to download to actually use the thing. By the way, the software itself is actually only available in windows ... which meant my glorious little XP box became my most important gateway to my beefy machine and its 9 GB of hard drive space was no way enough to do anything ... SIGH

Anyway, I continued, I battled, and finally installed it. It took FOREVER to transfer anything to that machine (this was the fault of my network really ... but come on USB 3.0 support maybe??). I had a few VMs that I wanted to try out on this brand new hypervisor; took around 30 minutes to transfer each. When I started it up .... BAM Invalid Disk Type 7 Error. Turns out VMWare, Virtual Box, and Virtual Player disks, even when converted to VMWare VMDK format, is not supported. You have to use the steps described here to fix it.

So finally, having a working hadoop installation I decided to start cloning the machines. SSH, simple cp command, and the 25 GB file took 2 hours to clone ... I had another 20 to do, so .... something was seriously wrong!!! Apparently the cp command in ESXi is HORRIBLE. Apparently this is a known issue, and the solution is to use vmkfstool (which is an incredible swiss army knife of awesome).

For example, if I wanted to copy a bunch of virtual machines, I would use the following command:

vmkfstools -i /vmfs/volumes/esxpublic/testvm2.vmdk /vmfs/volumes/production/testvmnew2.vmdk

This would clone the first vmdk to the location of the second vmdk. I did this inside the ESXi server itself.

For those interested, ESXi stores their data stores in the directory ~/vmfs/volumes/<datastore name>.

So I finally got the VM up and running, and things started to look good. I had 20 nice little virtual machines, a script I could use to create more if I wanted. The next thing should be easy right? Configure Hadoop as a cluster. That's when things got really bad :/

The Joys of Configuring Hadoop

Firstly, I decided to set up hostnames for all the machines. This involves actually configuring the /etc/hosts file, and changing the host name for each server; on every single machine. Since I didn't have access to a proper DNS Server, I had to hard code each host. I made sure that my router would give the same IP Address to the same host every time using each VMs MAC address, but it was still annoying.

So then I followed the guide in the Hadoop In Action guide (chapter 2) to deploy a cluster. I wanted name node, 1 job tracker, 1 secondary name node, and 3 data nodes (which also has the task tracker). I made sure that the name nodes could SSH to any other node without the need to supply a password (just use this guide). The Job Tracker needs to be able to SSH to the task trackers without passwords.

Before I go further, I suggest that you create a few scripts.
1. A script that will allow you to send commands to multiple machines one at a time (using ssh <hostname> 'command', e.g. ssh 192.168.1.1 'ls -lrt')
2. A script that allows you to copy (using scp; e.g. scp file <host>:<path>)

I would also suggest creating a symbolic link on each machine to the configuration files. That way when you change it once, you can use these scripts to quickly deploy to all the other machines. I found that Hadoop needs all the configurations to be the same, and having those scripts saved my sanity.

These configuration files are found in $HADOOP_HOME/conf. The files I changed were:
1. core-site.xml: I simply added the property fs.default.name, and set it to hdfs://<namenode>:9000

<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-namenode:9000</value>
</property>

2. mapred-site.xml: I added the Job Tracker location to this file. Property was mapped.job.tracker

<name>mapred.job.tracker</name>

<value>hadoop-jobtracker:9007</value>

</property>

I saw a lot of sites suggest putting hdfs:// in front. I think it works if you just supply the IP Address (or host) and port.

3. hdfs-site.xml: dfs.replication.name was changed to 3, so the system would replicate the same blocks in each file 3 times across the cluster.

<name>dfs.replication</name>

</property>

After all that was configured and deployed to all the machines, I ran hadoop namenode -format on the name node machine and then ./start-all.sh in $HADOOP_HOME/bin.

Something important to note. hadoop namenode -format will ask you if you are sure you want to format. Enter the capital letter Y and not the lowercase y, otherwise it will just say format was aborted and you'll think (like me) that it was because of some error in the system.

Worked?? Not at all! I got a whole bunch of Permission denied errors:

mkdir: cannot create directory `/var/log/hadoop/spuri2': Permission denied
chown: cannot access `/var/log/hadoop/spuri2': No such file or directory
/home/spuri2/spring_2012/Hadoop/hadoop/hadoop-1.0.2/bin/hadoop-daemon.sh: line 136: /var/run/hadoop/hadoop-spuri2-namenode.pid: Permission denied
head: cannot open `/var/log/hadoop/spuri2/hadoop-spuri2-namenode-gpu02.cluster.out' for reading: No such file or directory
localhost: /home/spuri2/.bashrc: line 10: /act/Modules/3.2.6/init/bash: No such file or directory
localhost: mkdir: cannot create directory `/var/log/hadoop/spuri2': Permission denied
localhost: chown: cannot access `/var/log/hadoop/spuri2': No such file or directory

(it was sort of like that above, I deleted the logs I got, that was sourced from stack overflow)

Well, turns out that in the conf directory there is a file called hadoop-env.sh. I had to set the HADOOP_CONF_DIR to a directory that the user had permissions to start. Originally I thought if I just use sudo I could start hadoop. Boy was I wrong. What happened was that it tried to start all the other components of the cluster using root. I supplied the root users and that just went from bad to worse! I ended up having to connect to each machine to manually delete all the Java processes...

So after I changed the configuration directory, I had restarted the cluster and got errors complaining about being unable to write to the /var/run/pid directory

starting namenode, logging to /var/log/hadoop/hduser/hadoop-hduser-namenode-sepdau.com.out
/usr/sbin/hadoop-daemon.sh: line 136: /var/run/hadoop/hadoop-hduser-namenode.pid: No such file or directory
localhost: mkdir: cannot create directory `/var/run/hadoop': Permission denied

(another stack overflow thing)

Turns out I have to configure the HADOOP_PID_DIR to something my hadoop user could write as well ...

So changing that, and deploying all changes to the rest of my clusters I finally started it using ./start-all.sh. Everything started except my Job Tracker ... I kept getting the error:

2013-02-07 09:37:20,302 FATAL org.apache.hadoop.mapred.JobTracker: java.net.BindException: Problem binding to hadoop-jobtracker/192.168.1.14:9007 : Cannot assign requested address
at org.apache.hadoop.ipc.Server.bind(Server.java:225)
at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:296)
at org.apache.hadoop.ipc.Server.<init>(Server.java:1393)
at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:505)
at org.apache.hadoop.ipc.RPC.getServer(RPC.java:466)

Regardless of what I did. I scoured the web trying to find the solution. Finally found information from this website which explained that what I should do is run in two phases;

- start-dfs.sh to start the name node, secondary node, and the data nodes.

- then SSH to the job tracker and start that using start-mapred.sh which will start the job tracker and task trackers. That response on the website explains it quite well.

So it all worked?! Not at all ... for some reason I got a bunch of errors complaining about mismatched version ids from the

2013-02-07 10:04:39,181 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-hadoop-user/dfs/data: namenode namespaceID = 398961412; datanode namespaceID = 1497962745
at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148)
at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:341)
at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:258)

Fortunately I found the solution on this guy's website. After that, I started the name node, and the job tracker. Then I visited the pages of each of the data nodes, and after a while it finally worked!!!

I uploaded the a bunch of files into the Hadoop File System:

(command I ran was: hadoop fs -lsr /)

-rw-r--r-- 3 hadoop-user supergroup 674566 2013-02-07 10:20 /experiments/book1-outlineofscience.txt

-rw-r--r-- 3 hadoop-user supergroup 1423801 2013-02-07 10:20 /experiments/book2-leodavinci.txt

-rw-r--r-- 3 hadoop-user supergroup 1573150 2013-02-07 10:20 /experiments/book3-gutenberg.txt

The 3 in the directory dump represents the number of times the file blocks have been replicated in the system. Pretty sweet eh?

Important Stuff

BTW, before this, I was getting a lot of messages in the log files saying:

java.io.IOException: File /tmp/hadoop-hadoop-user/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1376)

at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:539)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

at java.lang.reflect.Method.invoke(Unknown Source)

This means that you messed up your configuration. If you visit your name node server you might see that you have no data nodes. Before I got the whole thing working flawlessly, the web pages was showing 0 data nodes.

Use the following URLs to visit the nodes (borrowed from this website):

http://<namenode>:50070/ – web UI of the NameNode daemon
http://<job tracker>:50030/ – web UI of the JobTracker daemon
http://<data node>:50060/ – web UI of the TaskTracker daemon

You can replace name node, job tracker and data node with the ip or host of the respective servers.

So I finally have a working Hadoop Cluster! Now the next step I want to try is

- Do something useful with it

- Add some more data nodes and task trackers to it

- Try out some more advanced configuration options (like Kerberos security).

I'd also like to try to use other tools that work with Hadoop. But for now, I'm just admiring my 3 node cluster. I think I'll start mucking around with some live data now :)