The problems in Hadoop – When does it fail to deliver?

Hadoop is a great piece of software. It is not original but that certainly does not take away its glory. It builds on parallel processing, a concept that’s been around for decades. Although conceptually unoriginal, Hadoop shows the power of being free and open (as in beer!) and most of all shows about what usability is all about. It succeeded where most other parallel processing frameworks failed. So, now you know that I’m not a hater. On the contrary, I think Hadoop is amazing. But, it does not justify some blatant failures on the part of Hadoop, may it be architectural, conceptual or even documentation wise. Hadoop’s popularity should not shield it from the need to re-enginer and re-work problems in the Hadoop implementation. The point below are based on months of exploring and hacking around Hadoop. Do dig in.

  1. Did I hear someone say “Data Locality”?
  2. Hadoop harps over and over again on data locality. In some workshops conducted by Hadoop milkers, they just went on and on about this. They say whenever possible, Hadoop will attempt to start a task on a block of data that is stored locally on that node via HDFS. This sounds like a super feature, doesn’t it? It saves so much of bandwidth without having to transfer TBs of data, right?

    Hellll, no. It does not. What this means is that first you have to figure out a way of getting data into HDFS, the Hadoop Distributed File System. This is non trivial, unless you live in the last decade and all your data exists as files. Assuming that you do, let’s transfer the TBs of data over to HDFS. Now, it will start doing it’s whole “data locality” thing.

    Ermm, OK. Am I hit by a wave of brilliance or isn’t it what’s is supposed to do anyway? Let’s get our facts straight. To use Hadoop, our problem should be able to execute in parallel. If the problem or a at least a sub-problem can’t be parallelized it won’t gain much out of Hadoop. This means the task algorithm is independent of any specific part of the data it processes. Further simplifying this would be saying, any task can process any section of the data. So, doesn’t that mean the “data locality” thing is the obvious thing to do? Why, would the Hadoop developers even write some code that would make a task process data in another node unless something goes horribly wrong. The feature would be if it was doing otherwise! If a task has finished operating on the node’s local data and then would transfer data from another node and process this data, that would be a worthy feature of the conundrum. At least that would be worthy of noise.

  3. Would you please put everything back into files
  4. Do you have nicely structured data in databases? Maybe, you became a bit fancy and used the latest and greatest NoSQL data store? Now let me write down what you are thinking. “OK, let’s get some Hadoop jobs to run on this, cause I want to find all this hidden gold mines in my data, that will get me a front page of Forbes.” I hear you. Let’s get some Hadoop jobs rolling. But wait! What the …..? Why are all the samples in text files. A plethora of examples using CSV files, tab delimited files, space delimited files, and all other kind of neat files. Why is everyone going back a few decades and using files again? Haven’t all these guys heard of DBs and all that fancy stuff. It seems that you were too early an adopter of Data Stores.

    Files are the heroes of the Hadoop world. If you want to use Hadoop quickly and easily, the best path for you right is to export your data neatly into files and run all those snazzy word count samples (Pun intended!). Because without files Hadoop can’t do all that cool “data locality” shit. Everything has to be in HDFS first.

    So, what would you do to analyze your data in the hypothetical FUHadoopDB? First of all, implement about 10+ classes necessary to split and transfer data into the Hadoop nodes and run your tasks. Hadoop needs to know how to get data from FUHadoopDB, so let’s assume this is acceptable. Now, if you don’t store it in HDFS, you won’t get the data locality shit. If this is the case, when the task runs, they themselves will have to pull data from the FUHadoopDB to process the data. But, if you want the snazzy data locality shit you need to pull data from FUHadoopDB and store it in HDFS. You will not incur the penalty of pulling data while the tasks are running, but you pay it at the preparation stage of the job, in the form of transferring the data into HDFS. Oh and did I mention the additional disk space you would need to store the same data in HDFS. I wanted to save that disk space, so I chose to make my tasks pull data while running the tasks. The choice is yours.

  5. Java is OS independent, isn’t it?
  6. Java has its flaws but for the most part it runs smoothly on most OSs. Even if there are some OS issues, it can be ironed out easily. The Hadoop folks have issued document mostly based on Linux environments. They say Windows is supported, but ignored those ignorant people by not providing adequate documentation. Windows didn’t even make it to the recommended production environments. It can be used as a development platform, but then you will have to deploy it on Linux.

    I’m certainly not a windows fan. But if I write a Java program, I’d bother to make it run on Windows. If not, why the hell are you using Java? Why the trouble of coming up with freaking bytecode? Oh, the sleepless nights of all those good people who came up with byte code and JVMs and what not have gone to waste.

  7. CS 201: Object Oriented Programming
  8. If you are trying to integrate Hadoop into your platform, think again. Let me take the liberty of typing your thoughts. “Let’s just extend a few interfaces and plugin my authentication mechanism. It should be easy enough. I mean these guys designed the world’s greatest software that will end world hunger.”. I hear you again. If you are planning to do this, don’t. It’s like OOP anti patterns 101 in there. So many places that would say “if (kerberos)” and execute some security specific function. One of my colleagues went through this pain, and finally decided to that it’s easier to write keberos based authentication for his software and then make it work with Hadoop. With great power comes great responsibility. Hadoop fails to fulfil this responsibility.

Even with these issues, Hadoop’s popularity seems to be catching significant attention, and its rightfully deserved. Its ability to commodotize big data analytics should be exalted. But it’s my opinion that it got way too popular way too fast. The Hadoop community needs to have another go at revamping this great piece of software.

9 Comments

  1. Hadoop is open source and so every body can have OPEN opinions, but here are some facts that need to clarified before accusing Hadoop of these failures-
    1. Did I hear someone say “Data Locality”? – This means that the processing/computing code is literally pushed to the data node, than using the traditional approach of pulling data from RDBMS to compute nodes. “Data locality” has nothing to do with the statelessness of data mentioned in the critique.
    2. Would you please put everything back into files? – Hadoop falls in the category of a Data warehouse, where data is moved from OLTP sources to OLAP data warehouse. This process usually involves a lot of de-normalization and eventually the end data state is usually flat like in a file system. Hadoop off course has tools like SQOOP which will ensure that RDBMS tables are recreated in Hadoop HDFS and even create HIVE tables for you so that you never really have to understand whether you are dealing with FILEs.
    3. Java is OS independent, isn’t it? – Hadoop already runs on Linux and Solaris, works on Mac, being supported by Microsoft for Windows, so not sure what the concern is! Hadoop Java based implementation is what makes it possible.
    4. CS 201: Object Oriented Programming? – Hadoop offers multiple ways of processing data- HIVE as SQL like interface, PIG as a scripting interface, Streaming for all major languages along with the OOP based Java Map Reduce model. Not sure how the security concern impacts the OOP support in Hadoop Map Reduce. In fact, Java map reduce model just suggests implementing Map and Reduce methods and everything else relies on the OOP programming skill of the developer – a good developer would use Java OOP strength while others can write pure functional code even in Java

    BTW, Hadoop certainly does have limitations and that should be expected from a technology which is still to have a 1.0 release:)

    Reply

    1. Hi Indoos,

      I should have replied to your comment sooner, but I guess its better late than never. Let me address your replies based on the numbered list.

      1. Two parts to your critique of the first point. Let me break the answer into two.
      i. Pulling data from a RDBMS (or other source) – The data needs to be pulled from a RDBMS at the time of processing or before. So, the traditional approach you mention is exactly what the Hadoop folks have come up a complete new project called Sqoop for. So, the problem still exists, it is just assumed to be solved before.
      ii. Statelessness not connected to data locality – The statelessness of the data is what allows for this phenomenon advertised as ‘data locality’. It just allows the task started on the Hadoop data node, to process the data on that node.
      So, you still have to pull the from a RDBMS and still make sure the data processing is stateless, so the data can be processed in the node that it is on.

      2. SQOOP is a data importer/exporter. Why can’t Hadoop be designed to directly pull off databases? Sqoop is just a name given to a collection map reduce jobs that connects to different types of databases. I really don’t see the point of Sqoop being a separate project. But, that’s beyond my control 😉

      3. Yes, Hadoop should run on any platform. It should cause it uses Java and there another layer called a JVM that shields it from OS specifics. But, no one dreams of running Hadoop on a Windows box. It’s just discouraged! I don’t care about Microsoft, but why not write it in C++ then and get better performance. I’m positive you can still achieve better portability.

      4. Hive and Pig is not an excuse for Hadoop to follow bad code practices. That code base should allow anything from security to the scheduling algorithm to be plugged in and it should be designed to do so. It takes a while to someone to even understand the build structure. It is acting like a closed source project where only companies involved with Hadoop can extend and plug code by hacking the code itself. The code is open source, I really want to see the Hadoop code acting like its open source.

      Reply

    2. Hadoop is so complex for every simple task there’s ten other things to learn/configure and get it right. I might as well build my own grid, for example with Quartz/RabbitMQ. Key question to ask is, “Do you need Grid Computing? Or Distributed Storage/Persistence?”

      For most folks if you just want to run some computation over few hundred GB’s or TB’s, you’d be much better off with any other options from Open Source or Commercial, or just build your own –

      https://appliedalgo.com/appliedalgoweb/Doc/CompetitiveAnalysis/AppliedAlgo.Scheduler_LoadBalancer.FeatureComparison.htm

      Reply

  2. First, I appreciate that you’re willing to put yourself out there and question Hadoop despite popularity – albeit a little rant-ish. 🙂

    1&2. What Indoos said.

    3. Development is supported on Windows, furthermore Microsoft supports Hadoop on Windows Azure which is nice for spinning up a compute cluster and having some fun. Other than that, why would you want full production-ready support for Hadoop on Windows? First, you’re wasting precious node resources on running Windows and secondly… who wants to pay licensing fees for multi-hundred node compute clusters? I live on the Microsoft stack so no hate here – from an engineering standpoint alone this doesn’t make sense to support in production.

    4. If you’re looking to apply OOP to this kind of problem, or put another way you need to solve this kind of problem in your application then what you need is Map Reduce, not Hadoop. Implementing MapReduce in an application isn’t hard as a parallelization technique.

    I’d remember what problem Hadoop is solving. Its not a typical application whereby many of your points might otherwise be stronger. When it comes to petabytes (or many terabytes) of data and hundreds (or thousands) of commodity machines working in parallel cross platform compatibility and OOP principles are not things I’d consider when solving this engineering challenge.

    Reply

    1. Hi b2berry,

      I have been told subtlety is not my forte, so I hope the post is taken for its merits, rather than as a rant. Let me try to answer you in the question order.

      For 1 & 2, please refer my reply to Indoos.

      3. It doesn’t matter if you or I don’t have the cash to run on Windows, but as a Java project it has to run properly on Windows. I’m glad Microsoft has done it, but has the Apache Hadoop project done it? From an operations standpoint (not just from a development standpoint, I consider both to be engineering) all of us can give a million reasons as to why Windows sucks! But it does not make sense to say you need a linux box to run Hadoop node (or use Azure as you say).

      4. Your critique makes no sense here. Hadoop is written using OOP. So, it better be a good OOP, that can be extensible in every way. And I completely and vehemently disagree on your point that OOP principles are things that Hadoop should not consider. It is an open source project and under the Apache brand. And especially because of that it has to be extensible as much as possible, there by following abstractions and design patterns that reflect good software architecture. Take a look at good libraries such as HTTP Core (also under apache) or libssl or any other good open source project. I haven’t heard people touching that code unless it’s a rare bug. And these serves enormous amount of network traffic going around in the world, in a scale much larger than Hadoop.

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s