Wednesday, February 20, 2013

Moving on ...

After a few weeks working with Hadoop, setting up my little cluster, and running a bunch of Map Reduce tasks on it, I finally got too frustrated with the whole speed thing and gave up.

The goal I had was to implement a Data Warehouse solution within Hadoop using the barebones Hadoop software. I guess the use case for it just doesn't map with what Hadoop offered; which was a giant distributed file system.

I tried putting stuff on top of the cluster; such as Hive, HBase, and all that fancy stuff but it just wasn't giving me the performance I wanted; which was the ability to process data in real-time as it arrived.


When I say I'm giving up on it, what I really mean is that I'm giving up on using Hadoop and Map-Reduce to process my data. I've been experimenting now with the other libraries on offer, and it appears that SHARK and SPARK are quite promising (1 TB in 7 seconds anyone?). Its still not up to the speed I want, but its definitely faster than what I had originally.

I'll keep this blog going about how I'm doing with the SHARK/SPARK combo. I heard some good things about it though its very bleeding edge. I guess only time will tell whether it will mature into a competitive product.

I've also been looking at the solution from Cloudera, but its not ready yet. Apparently they are planning a GA release early April 2013. That's also rather promising; they too gave up on Map Reduce and instead opted for a much more efficient model.

Here are some slides on the topic:
(Cloudera) http://www.slideshare.net/cloudera/cloudera-impala-a-modern-sql-engine-for-apache-hadoop

(Spark + Shark) http://www.slideshare.net/Hadoop_Summit/spark-and-shark


No comments:

Post a Comment