HP Discover

Defining Big Data

HP Discover

November 12, 2012 at 4:27 PM

That’s great, except that it is self-contradictory! 5 out of the 7 things the author considers “big data” are not only susceptible to manipulation by SQL (in a well-designed database of course), but have representation on Vertica’s “paying customers” list. Twitter is a customer (and I thank them for their ability to simultaneously give us props and jabs in venues like ACM SIGMOD). We hardly ever lose in the CDR space (or any xDR, really). Zynga has petabytes of what could be loosely described as “web logs” and “social data” stored in Vertica. The evidence list becomes long and boring quite quickly, so I won’t get into how the other 2 out of 7 “Big Data” areas are, as presented, too nebulous to decide one way or the other.

I can’t claim to have a definitive definition of “Big Data”. But I can tell you that for each meaningful result (such as a sale of a product), there are many website clicks made, and for each “click” there are many impressions (things that are presented to the user but not clicked). If you want to analyze these things, and maybe run some tests where you try experiments on your customers and prospects to see what does the best job of reeling them in, you will strain the the abilities of single-machine processing, “traditional” RDBMSs, and many statistics packages and algorithms (yes, including your trusty Excel spreadsheet). Then there is machine-generated data. The handful of self-aware devices you own (your desktop PC, laptop, GPS-enabled smart phone, smart meter, car, refrigerator, etc.) have much more free time with which to generate “Big Data” than you do. (For example, a fair-sized cable MSO has a “sensor network” with 20 million networked devices that never rest, producing 10+ billion rows a day.)

So now that the definition of “Big Data” is as clear as mud, let us next consider structured vs. unstructured data. I have heard people say that “web logs are unstructured data”. This is outright silly. The average web log is entirely machine-generated one line at a time, and to do any serious analysis on it you are going to have to parse it into a format with some amount of structure (at least the date and time, session, page, etc.). Sure, it can be stored as “unstructured data” in your favorite file system, but that’s a simple matter of procrastination on the issue of writing (or finding that someone has already written a parser. On the other hand, Twitter data, with its rigid 140-character limit is quite “structured”, but figuring out what it “means” is nonetheless a respectable challenge.

So instead, I would implore you to consider “machine information” versus “human information”. It is probably true that, byte for byte, there is 10x more “human” information. The slide deck our sales guys use takes up 10x more space on disk than the spreadsheet that contains the funnel and prospect list. Viral videos probably (I am not privy to the hard data) consume 10x more disk space than the IPDRs and web logs recording their accesses.

But while disk size is a fun, quotable metric, it says little about the nature of the “Big Data” problem you have to solve. Instead, consider whether you have “machine” or “human” information. You should be able to tell at a glance, and this will help you choose the right tools. If it’s machine-generated financial trades, a scale-out SQL database with some time series analytics extensions will do nicely. If it’s the tweets of Twitter twits, you can use a structured data tool, but you’re going to need some (in a very loose sense of the word) natural language sentiment analysis and graph processing packages. If it is a bunch of PDFs, Word documents, HTML pages, PowerPoint presentations, and so on scattered across eleventeen different web servers, CMS systems, and file systems in your enterprise, you may need a high-powered “human information” system like Autonomy (and if you are an intelligence agency that needs to watch every video broadcast worldwide to gauge international sentiment, I think they can help you with that too…).

There is a point to all of this. I can’t claim to have invented this tag line, but I wish I had. You should “Know SQL” before you assume you should “NoSQL”. While “Big Data” couldn’t have lived in an SQL database as they existed 10 years ago, we have different kinds of SQL databases now, that are “web scale”, high performance, designed for analytic workloads, cost effective, and so on. It’s no longer, as a tech visionary in our own back yard recently said, “just a box that can’t keep up”. If you have “Big Data” upon which structure can be imposed, analytic databases are very efficient, flexible, easy to use, and come with all the features people have come to expect from 30+ years of development. (Try one. We have a free, community download!)

Let’s start with an easy question. What’s “Big Data”? Fortunately, I read the answer to that in one of my favorite tech news sources just the other day: http://www.theregister.co.uk/2012/08/27/how_did_big_data_get_so_big/. The answer, for those who won’t bother with the link, is “Big data is any data that doesn’t fit well into tables and that generally responds poorly to manipulation by SQL” … “A Twitter feed is Big Data; the census isn’t. Images, graphical traces, Call Detail Records (CDRs) from telecoms companies, web logs, social data, RFID output can all be Big Data. Lists of your employees, customers, products are not.”

Yahoo Finance
Jamie Dimon is worried the US economy is headed back to the 1970s
JPMorgan's CEO is concerned the US economy could be in for a repeat of the stagflation that hampered the country during the 1970s.
15h ago
Yahoo Sports
Based on the odds, here's what the top 10 picks of the NFL Draft will be
What would a mock draft look like using just betting odds?
2d ago
Yahoo Sports
WNBA Draft winners and losers: As you may have guessed, the Fever did pretty well. The Liberty? Perhaps not
Here are five franchises who stood out, for better or for worse.
8d ago
Yahoo TV
Everyone's still talking about the 'SNL' Beavis and Butt-Head sketch. Cast members and experts explain why it's an instant classic.
Ryan Gosling, who starred in the skit, couldn't keep a straight face — and neither could some of the "Saturday Night Live" cast.
17h ago
Yahoo Sports
Jets trade QB Zach Wilson to Broncos
Wilson's starting over in Denver.
11h ago
Yahoo Sports
Dave McCarty, player on 2004 Red Sox championship team, dies 1 week after team's reunion
The Red Sox were already mourning the loss of Tim Wakefield from that 2004 team.
3d ago
Yahoo Sports
Ryan Garcia drops Devin Haney 3 times en route to stunning upset
The 25-year-old labeled "mentally fragile" by many delivered the upset for the ages.
3d ago
Yahoo Sports
Chiefs make Andy Reid NFL's highest-paid coach, sign president Mark Donovan, GM Brett Veach to extensions
Reid's deal reportedly runs through 2029 and makes him the highest-paid coach in the NFL.
1d ago
Yahoo Sports
NFL mock draft 2024: With one major trade-up, it's a QB party in the top 5
Our final 2024 mock draft projects four quarterbacks in the first five picks, but the Cardinals at No. 4 might represent the key pivot point of the entire board.
2d ago
Yahoo Sports
Yankees' Nestor Cortés told by MLB his pump-fake pitch is illegal
Cortés' attempt didn't fool Andrés Giménez, who fouled off the pitch.
4d ago
Yahoo Life
Here’s when people think old age begins — and why experts think it’s starting later
People's definition of "old age" is older than it used to be, new research suggests.
2d ago
Yahoo Finance
Donald Trump nabs additional $1.2 billion 'earnout' bonus from DJT stock
Trump is entitled to an additional 36 million shares if the company's share price trades above $17.50 "for twenty out of any thirty trading days" over the next three years.
15h ago
Yahoo Sports
Arch Manning dominates in the Texas spring game, and Jaden Rashada enters the transfer portal
Dan Wetzel, Ross Dellenger & SI’s Pat Forde react to the huge performance this weekend by Texas QB Arch Manning, Michigan and Notre Dame's spring games, Jaden Rashada entering the transfer portal, and more
2d ago
Yahoo Sports
Lions' new uniforms get leaked early, and they find some humor in it
The Lions' new uniforms got released prematurely.
6d ago
Autoblog
The new Ford Mustang's V8 is available as a crate engine
Ford offers the new Mustang's updated 5.0-liter Coyote V8 as a crate engine, and it also sells a supercharger kit that unlocks a total of 810 horsepower.
2d ago
Yahoo Finance
What US taxpayers will get for another $61 billion to Ukraine
Congress is finally providing more of the aid Ukraine needs to survive. Here's why this is money well spent.
13h ago
Yahoo Sports
Robert Kraft reportedly warned Falcons owner Arthur Blank not to trust Bill Belichick during head coach interviews
Bill Belichick's former boss Robert Kraft reportedly tanked his chances of getting hired as the Falcons head coach.
7d ago
Yahoo Sports
Yankees manager Aaron Boone ejected after fan mouths off to home plate umpire
You don't see an ejection like this every day.
2d ago
Yahoo Sports
Pass or Fail: Broncos release 'Mile High Collection,' first new uniforms in over 25 years
The Broncos may have committed the greatest fashion faux pas there is: being boring.
2d ago
Yahoo Sports
Arch Manning puts on a show in Texas' spring game, throwing for 3 touchdowns
Arch Manning gave Texas football fans an enticing look at the future, throwing for 355 yards and three touchdowns in the Longhorns' Orange-White spring game.
3d ago

News

Life

Entertainment

Finance

Sports

New on Yahoo

Defining Big Data

Recommended Stories

Jamie Dimon is worried the US economy is headed back to the 1970s

Based on the odds, here's what the top 10 picks of the NFL Draft will be

WNBA Draft winners and losers: As you may have guessed, the Fever did pretty well. The Liberty? Perhaps not

Everyone's still talking about the 'SNL' Beavis and Butt-Head sketch. Cast members and experts explain why it's an instant classic.

Jets trade QB Zach Wilson to Broncos

Dave McCarty, player on 2004 Red Sox championship team, dies 1 week after team's reunion

Ryan Garcia drops Devin Haney 3 times en route to stunning upset

Chiefs make Andy Reid NFL's highest-paid coach, sign president Mark Donovan, GM Brett Veach to extensions

NFL mock draft 2024: With one major trade-up, it's a QB party in the top 5

Yankees' Nestor Cortés told by MLB his pump-fake pitch is illegal

Here’s when people think old age begins — and why experts think it’s starting later

Donald Trump nabs additional $1.2 billion 'earnout' bonus from DJT stock

Arch Manning dominates in the Texas spring game, and Jaden Rashada enters the transfer portal

Lions' new uniforms get leaked early, and they find some humor in it

The new Ford Mustang's V8 is available as a crate engine

What US taxpayers will get for another $61 billion to Ukraine

Robert Kraft reportedly warned Falcons owner Arthur Blank not to trust Bill Belichick during head coach interviews

Yankees manager Aaron Boone ejected after fan mouths off to home plate umpire

Pass or Fail: Broncos release 'Mile High Collection,' first new uniforms in over 25 years

Arch Manning puts on a show in Texas' spring game, throwing for 3 touchdowns