What is 'Big Data,' anyway? Authors of a new book try to explain

Rob Walker
How Big Data Will Transform Our Lives
Big Data Will Change the Way We Think and Live

“Big data” has become a really big buzz-phrase — tossed around in conversations about everything from business to surveillance; cited as a tool to improve driving, hiring, understanding dogs, and everything else; and, inevitably, dismissed as a bunch of hype.

But what exactly is big data, anyway? Big Data: A Revolution That Will Transform How We Live, Work, and Think by Viktor Mayer-Schönberger and Kenneth Cukier, offers an answer. Their book is a wide-ranging assessment of “the ability of society to harness information in novel ways to produce useful insights or goods and services of significant value.” And while they acknowledge that the term itself has become amorphous, they frame their subject pretty clearly: “Big data refers to things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value, in ways that change markets, organizations, the relationship between citizens and governments, and more.”

That (not to mention the book’s subtitle) might sound a little hype-y, but Big Data is fairly even-handed: Early chapters explore the hope and potential around the way massive information sets are being created and mined, but later ones are clear about risks, pitfalls, and dangers. Mayer-Schönberger is Professor of Internet Governance and Regulation at the Oxford Internet Institute / Oxford University; Cukier is “data editor” for The Economist. Their book raised a few questions for me — so I asked the authors. Here’s what they said.

I'd like to start toward the end: One of your later chapters examines "the dark side of big data," and among other things you note concerns about privacy and the possibility of using "big-data predictions" to in effect penalize people for behavior they seem likely to engage in, but haven't. You even mention the NSA at one point. So I wonder what you've made of the debate about more recent surveillance revelations related to the agency: There's a lot of focus on the collection of the data, for instance, but should we be talking about how it's analyzed?

Kenneth Cukier: The question draws an excellent distinction — one that's sadly missing from the debate. The disclosures have been mostly about the collection and not the use of the data. And when intelligence agencies explain how they work with the data, the method seems oddly old-school: targeted surveillance, not too different from the days of alligator-clips atop copper wires. Of course we're probably not told the whole story and they're actually running massive statistical regressions across all the data to hunt for patterns that they didn't know to look for in advance. That's what Facebook and LinkedIn data-scientists would do with it. But we haven't yet seen evidence that this is what the NSA is doing.

That said, the collection alone is troubling because it is happening with insufficient oversight. And the goal of intelligence is to prevent bad things from happening — it's about prediction. As we lay out in the book, this may be troubling when people are penalized for what they only have propensity to do, not for what they've done. So we have to be very careful using this ability, as it improves to the degree that it becomes more established.

You make a compelling case about the limitations of sampling (as opposed to more comprehensive big data approaches) and how we've come to accept it perhaps more than we should. But among the examples you mention is voter intent. It's not like there's a comprehensive database of who everyone intends to vote for, is there? How does big data actually provide an alternative here? Isn't there a distinction between what we want to measure and what we can measure?

Cukier: Actually, there is a database of every voter and their intentions. Both major parties contract with different data providers that are loosely affiliated with the parties, to tap databases of all Americans. The first variable is if the person is registered to vote and if he or she actually cast a ballot in the most recent election. The Democrats in 2012 had an internal database of every voter in America and asked three questions of it: Do you support Obama; are you likely to vote; and if you are undecided, are you persuadable? By ranking people based on that last measure, the Dems could know where to best spend their advertising budget for maximum impact.

Big data was critical: sampling works well for basic questions like what candidate a person supports. But it's less useful when you want to drill down into the granular — like what candidate Asian-American women with college degrees support. To do that, you may need to give up your sample and go for it all.

Yet the broader point is correct: there is a difference between what we want to measure and what we can measure. And we need to be on guard that we don't confuse the two. For example, in the Vietnam War, the Pentagon used the metric of the body count as a way to measure progress, when that data wasn't really meaningful to what they wanted to depict. Sadly, I fret this fallibility is something that we'll just have to learn to live with, as we have in so many other domains.

Many of your examples involve scrutinizing data that already exists (including instances where it's mined for reasons that have nothing to do with why it was gathered), but I was very interested to learn about "datafication" that involves setting out to collect new information in new ways: For instance, UPS "datifying" its vehicle fleet by gathering mechanical information that predicts and minimizes breakdowns. This almost seem like a distinct category to me. Do you think of it as a fundamentally different form of big data?

Viktor Mayer-Schönberger: It is tempting to be dazzled by the many new types of data that are being collected — from engine sensors in UPS vehicles, to heart rates in

premature babies, to human posture. But that is how datafication works in practice: at first we think it is impossible to render something in data form, then somebody comes up with a nifty and cost-efficient idea to do so, and we are amazed by the applications that this will enable, and then we come to accept it as the new normal. A few years ago, this happened with geo-location data, and before it was with web browsing data (gleaned through cookies). It is a sign of the continuing progress of datafication.

You're right that dataficiation is fundamentally different than big data. For example, the 19th century American navigator Commodore Maury, who invented tidal maps, datafied the logbooks of past sea voyages by extracting information about the wind and waves at a given location. But we can get the most of big data today because so many new elements of our lives are being rendered into a data form, which was extremely hard to do in the past.

You emphasize that making the most of big data means we have to "shed some of [our] obsession for causality in exchange for simple correlations: not knowing why but only what." This means breaking from the tradition of coming up with a hypothesis and testing it: It doesn't matter whether we can explain a correlation that big data reveals, we should just act on it. That's a big shift! I'm curious if when you're out talking about the book whether you get a lot of resistance to that idea, because it seems crucial to what you call the "big data mindset."

Mayer-Schönberger: Yes, we do encounter resistance on this point, but intriguingly, it's rarely from the real experts in their field. They often know how tentative their causal conclusions are, or how much they are actually based on correlations rather than truly comprehending the exact causality of things. Also, we often get mischaracterized as either suggesting that theories don't matter or causality is not important. We don't argue either. In fact, theories will continue to matter very much, but the concrete hypothesis derived from a theory less so.

Take Google Flu Trends. The theory that what people search for could correlate with human health in a given location was crucial for Google Flu Trends to happen. But none of Google's engineers could ever have guessed the exact hypothesis to test — that is, the exact search terms that best predict the spread of the flu. After all, the company handles around 3 billion searches every day. So big data analysis did that for them.

Causal connections are really valuable where and if one can find them. But looking for them at great cost and coming up empty is less useful, we suggest, than looking for correlations — not least because such correlations can help identify what potential connections between two phenomena should be investigated for a possible causal link. In that very sense, big data analysis actually helps causal investigations as well.

Finally, I was struck by how many examples in the book involved businesses that have amassed incredible data sets and learned to use them to boost sales or improve marketing. You have the story of how Wal Mart mined its past data and figured out that people preparing for a hurricane by purchasing flashlights and the like also tended to buy Pop-Tarts — so it put Pop-Tarts at the front of the store during hurricane season, and sales increased. Is there any concern about how much big data is in effect owned by business, and deployed largely in the service of the profit motive? I think one thing that makes people nervous about the big data idea is that it's so often opaque. But do the benefits outweigh those concerns? Should we stop worrying and just be thankful for the conveniently placed Pop-Tarts?

Mayer-Schönberger: There is a value in having conveniently placed Pop-Tarts, and it isn't just that Wal Mart is making more money. It is also that shoppers find faster what they are likely looking for. Sometimes big data gets badly mischaracterized as just a tool to create more targeted advertising online. But UPS uses big data to save millions of gallons of fuel — and thus improve both its bottom line and the environment. Google aiding public health agencies in predicting the spread of the flu, or Decide.com helping consumers save a bundle has nothing to do with targeted advertising, and create positive effects beyond a single company's quarterly profit. We need to cast our gaze wider when we want to understand big data's upside (and incidentally, also its "dark sides").

My thanks to Mayer-Schönberger and Cukier for taking the time to answer these questions. Their book is: Big Data: A Revolution That Will Transform How We Live, Work, and Think.