How to be Big Data-native?

Big data has spawned a set of tools that deliver results beyond the buzz. It has started delivering real insights for companies, which result in more effective decisions.

When middleware natively supports big data, big data becomes more than just another option. It becomes the default. Let’s examine this idea:

  1. Big Data Storage

Whenever you think of storage, (almost) everyone thinks of an RDBMS mysql, postgres, mariaDB, Oracle, etc. If you convert to supporting native Big Data, you turn into to NoSQL options that sacrifices SQL for scale. You start storing everything in Cassandra, HBase, mongoDB, redis, etc. You don’t have to dread the day your data volume becomes too big to handle. When it does all you do now is configure a new node and maybe tune the cluster a bit and you are done.

  1. Big Data Analytics

If you are not looking at analytics, you should (Google for innovate or die). All analytics support for big data. If you thought about running some SQL or spread sheet macros, it’s time to move on. Start thinking Hadoop, Big Query, Drill, etc. Natively supporting Big Data analytics allows anyone who sets up the middleware to instantly enable their departments and teams to harvest their data silos, no matter how big they are.

  1. Big Data Speed

There is no point of BigData storage and Big Data Analytics if you can’t collect big data fast enough. Looking at web scale transactions it is not unusual to have millions a second. Before middleware vendors start advocating they need to make sure the middleware is up to par. Computers are definitely fast enough so that even a single node can handle around 10,000 TPS.

The biggest concern I’ve seen is many companies claiming to be able to adapt to Big Data without actually supporting it natively. It is a nightmare to allow to scale for Big Data after choosing incompatible technologies. The concepts differ, the trade offs are different and the effectiveness is very low. Big data has reached the importance level to for middleware evaluators to add another section to their RFPs. And, yes that’s whether the middleware is “Big Data native”.

BAM, SOA & Big Data

Leveraging Big Data has become a commodity for most IT departments. It’s like the mobile phone. You can’t remember the times when you couldn’t just call someone from your mobile, no matter where you are in the world, can you? Similarly, IT folks can’t remember the days when files were too big to summarize, or grep, or even just store. Setup a Hadoop cluster and everything can be stored, analyzed and made sense of. But, then I tried to ask the question, what if the data is not stored in a file? What if it was all flying around in my system?

Deployment

Shown above is a setup that is not uncommon deployment of a production SOA setup. Let’s summarize briefly what each server does:

  • An ESB cluster fronts all the traffic and does some content based routing (CBR).
  • Internal and external app server clusters host apps that serve different audiences.
  • A Data Services Server cluster exposes Database operations as a service.
  • A BPS cluster coordinates a bunch of processes between the ESB, one App server cluster and the DSS cluster.

Hard to digest? Fear not. It’s a complicated system that would serve a lot of complex requirements while enhancing re-use, interoperability and all other good things SOA brings.

Now, in this kind of system whether it’s SOA enabled or not, there lies a tremendous amount of data. And No, they are not stored as files. They are transferred between your servers and systems. Tons and tons of valuable data are going through your system everyday. What if you could excavate this treasure of data and make use of all the hidden gems to derive business intelligence?

The answer to this can be achieved through Business Activity Monitoring (BAM-ing). It would involve the process of aggregating, analyzing and presenting data. SOA and BAM was always a love story. As system functions were exposed as services, monitoring these services meant you were able to monitor the whole system. Most of the time, if the system architects were smart, they used open standards, that made plugging and monitoring systems even easier.

But even with BAM, it was impossible to capture every message and every request that passed through the server. The data growth alone would be tremendous for a fairly active deployment. So, here we have a Big Data problem, but it is not a typical one. A big data problem that concerns live data. So to actually fully monitor all the data that passes through your system you need a BAM solution that is Big Data ready. In other words, to make full sense of the data and derive intelligence out of the data that passes through modern systems, we need a Business Activity Monitor that is Big Data ready.

Now, a system architect has to worry about BAM, SOA and Big Data as they are essentially interwined. A solution that delivers anything less, is well short of a visionary.

Do you trust Google Big Query with your Big Data?

Google has come up with a fantastic service to analyze large amounts of data. It’s called BigQuery and it allows you to run analysis on big data on the cloud. As expected, the tool has a superb, intuitive web UI. The data analysis language uses SQL like queries. (Hive, anyone 😉 ). Have a look at the  Big Query Tutorial, it looks pretty neat. So, now all you need to do to run queries is to upload your data to Google using the form shown below. It allows you to upload a file or point to it using Google’s cloud storage.

Now, the interesting question here is that to analyze using BigQuery how much of that data are you willing to give Google? And how long will that take? The answer won’t be “Let me quickly upload a 500 GB file and run some queries”. That amount of data would definitely take some time to upload. So, effectively, this SaaS becomes pretty useless as more and more data volumes need to be uploaded for analysis.

Everyone trusts Google ( 🙂 ), so this concern might be easily ignored. But a potential other problem I see is the “Privacy Policies” that are violated. Usually, when you want to analyze data, it can contain sensitive data such as user behavior patterns and so forth. How comfortable will your customers be if you hand that data over to Google? Even anonymizing this data might not save you from a potential legal breach.

I still believe setting up your own data analysis and monitoring platform is the best way to go. Thoughts? I’d love to hear them.