Defining Big Data

That’s great, except that it is self-contradictory!  5 out of the 7 things the author considers “big data” are not only susceptible to manipulation by SQL (in a well-designed database of course), but have representation on Vertica’s “paying customers” list.  Twitter is a customer (and I thank them for their ability to simultaneously give us props and jabs in venues like ACM SIGMOD).  We hardly ever lose in the CDR space (or any xDR, really).  Zynga has petabytes of what could be loosely described as “web logs” and “social data” stored in Vertica.  The evidence list becomes long and boring quite quickly, so I won’t get into how the other 2 out of 7 “Big Data” areas are, as presented, too nebulous to decide one way or the other.

I can’t claim to have a definitive definition of “Big Data”.   But I can tell you that for each meaningful result (such as a sale of a product), there are many website clicks made, and for each “click” there are many impressions (things that are presented to the user but not clicked).  If you want to analyze these things, and maybe run some tests where you try experiments on your customers and prospects to see what does the best job of reeling them in, you will strain the the abilities of single-machine processing, “traditional” RDBMSs, and many statistics packages and algorithms (yes, including your trusty Excel spreadsheet).  Then there is machine-generated data.  The handful of self-aware devices you own (your desktop PC, laptop, GPS-enabled smart phone, smart meter, car, refrigerator, etc.) have much more free time with which to generate “Big Data” than you do.  (For example, a fair-sized cable MSO has a “sensor network” with 20 million networked devices that never rest, producing 10+ billion rows a day.)

So now that the definition of “Big Data” is as clear as mud, let us next consider structured vs. unstructured data.  I have heard people say that “web logs are unstructured data”.  This is outright silly.  The average web log is entirely machine-generated one line at a time, and to do any serious analysis on it you are going to have to parse it into a format with some amount of structure (at least the date and time, session, page, etc.).  Sure, it can be stored as “unstructured data” in your favorite file system, but that’s a simple matter of procrastination on the issue of writing (or finding that someone has already written a parser.  On the other hand, Twitter data, with its rigid 140-character limit is quite “structured”, but figuring out what it “means” is nonetheless a respectable challenge.

So instead, I would implore you to consider “machine information” versus “human information”.  It is probably true that, byte for byte, there is 10x more “human” information.  The slide deck our sales guys use takes up 10x more space on disk than the spreadsheet that contains the funnel and prospect list.  Viral videos probably (I am not privy to the hard data) consume 10x more disk space than the IPDRs and web logs recording their accesses.

But while disk size is a fun, quotable metric, it says little about the nature of the “Big Data” problem you have to solve.  Instead, consider whether you have “machine” or “human” information.  You should be able to tell at a glance, and this will help you choose the right tools.  If it’s machine-generated financial trades, a scale-out SQL database with some time series analytics extensions will do nicely.  If it’s the tweets of Twitter twits, you can use a structured data tool, but you’re going to need some (in a very loose sense of the word) natural language sentiment analysis and graph processing packages.  If it is a bunch of PDFs, Word documents, HTML pages, PowerPoint presentations, and so on scattered across eleventeen different web servers, CMS systems, and file systems in your enterprise, you may need a high-powered “human information” system like Autonomy (and if you are an intelligence agency that needs to watch every video broadcast worldwide to gauge international sentiment, I think they can help you with that too…).

There is a point to all of this.  I can’t claim to have invented this tag line, but I wish I had.  You should “Know SQL” before you assume you should “NoSQL”.  While “Big Data” couldn’t have lived in an SQL database as they existed 10 years ago, we have different kinds of SQL databases now, that are “web scale”, high performance, designed for analytic workloads, cost effective, and so on.  It’s no longer, as a tech visionary in our own back yard recently said, “just a box that can’t keep up”.  If you have “Big Data” upon which structure can be imposed, analytic databases are very efficient, flexible, easy to use, and come with all the features people have come to expect from 30+ years of development.  (Try one.  We have a free, community download!)

Let’s start with an easy question.  What’s “Big Data”?  Fortunately, I read the answer to that in one of my favorite tech news sources just the other day:  The answer, for those who won’t bother with the link, is “Big data is any data that  doesn’t fit well into tables and that generally responds poorly to manipulation by SQL” … “A Twitter feed is Big Data; the census isn’t. Images, graphical traces, Call Detail Records (CDRs) from telecoms companies, web logs, social data, RFID output can all be Big Data. Lists of your employees, customers, products are not.”