In this interview, Kenneth Cukier, senior editor at The Economist and co-author of the brilliant book Big Data, details how companies are making use of Big Data, what it actually refers to and the risks it presents.
Could you give some interesting examples of how companies are making use of Big Data?
There are zillions of examples. A classic one is from Charles Duhigg’s excellent book Habit, in which the US retailer Target could tell if a customer was pregnant based on their shopping patterns. It’s basic data-mining. You let shoppers with a loyalty card sign up for a baby-shower registry. You now have an idea of some people who are pregnant, and when they’re due. Then, you watch how their shopping changes over time.
In the first trimester, they’re buying things like vitamins. In the second trimester, as the body changes, they’re buying lotions (typically unscented). In the third trimester, they’re buying cloths and things for the new arrival. When you start to see a similar pattern form among other customers, you can make a prediction if they’re pregnant. This matters because people’s spending is higher and new brand loyalties form as people become parents.
Lastly, I should add that this example gets totally misunderstood: it’s NOT that the company knows that a person is pregnant before the women herself knows. It’s that without any explicit disclosure, a company can infer a woman’s status. Strikingly, this technique can be used for lots of things -- notably, predicting medical conditions from search queries before people know they have a disease, which has been shown by Eric Horvitz of Microsoft Research for pancreatic cancer.
In your opinion, what is the most significant problem that has been solved using Big Data?
Well, considering that I treat big data as simply a user-friendly way to refer to machine learning, and a branch of machine learning is deep learning (which of course works far better now than in the past in large part because of more data), then I can claim all of the modern AL revival as a win for big data: from Alexa voice recognition to self-driving cars, to Google image search, to the recent advances in computational pathology. From this, how could I possibly chose the most significant?
But if I had to choose, I’d say using image-recognition to diagnose disease is the most important. It will change healthcare for the better and lead to better lives for many people.
What are the greatest risks presented by Big Data and AI?
The biggest danger is the loss of human freedom from a surveillance society. I don’t fear that AI take over the world -- at least, we’re a very long way from this being probable. But it’s far more likely that people will use AI to destructive ends, that harm others. I am very worried about that -- and looking at the governance we have in established democracies in the West, the fears are real. As for developing countries and China, it’s clear that the fears are even closer of this happening there. Yes, companies are powerful. But we’d rather have a tyranny of Facebook likes than a real tyranny.
What sorts of problems can Big Data not solve?
It’s a method like any other, and has its uses and non-uses. The question is like asking “where does stats or empirical evidence” not apply. The answer is: a whole host of areas. Matters of the heart, of risk-taking and judgement and instinct and emotion. Or, areas where just a small amount of data will do, say, grading a student exam -- where all you care about is how individuals performed and their distribution as a group. But we’ll find more and more uses for this data.
For example, if all the student exam records were digital, nationwide, and went back a decade, we might find relationships between exam performance and weather, breakfast, distance from school, etc, that we never knew before. I should add that big data can help run simulations, to predict future possibilities, but all simulations are prone to error because of the models not the data. So you could use big data to predict the potential consequences of Brexit -- but it might not be right.
Could you suggest a novel trading strategy which makes use of Big Data?
There is so much incentive for people to get a leg up on the markets that, almost axiomatically, I won’t come up with anything original. That said, one area of interesting study, as algorithmic trading increases its share of transactions, would be the degree to which markets respond to “human events” or “model-fed events” -- that is to say, a local sports team win (French market if France wins the World Cup; the NY Yankees win the world series, etc) versus changes where it’s data fed into an algo. Then, if we can find triggers to human emotions that might affect traders, we’d have an insight on how markets may shift. But of course I hate this very idea, since I’d rather that markets reflect fundamentals rather than the vagaries of human or algorithmic behaviour!
What are the differences between Big Data, Machine Learning and AI?
Everyone will have a different answer to this, and it’s a tiresome matter of semantics. In my mind, the term “big data” is basically a popularisation of machine learning. AI refers to the broad field, and machine learning is just one general strand of it: using data and stats to infer answers that aren’t programmed in at the outset.
There’s a lot more to these terms, but this simple breakdown should do the trick. I’d point out, though, that big data need not just mean ML. Big data is really a catch-all for the new era we’re in, where we have more data than ever, and we can do more with it than ever before. For example, we never had a way to easily measure the number and location of people on public transport in real-time, and now we can. There’s no ML involved with this -- it’s just “counting” -- but we can now do these things, and derive new insights from it.
Who benefits and who loses from Big Data?
Society benefits. No one loses per se. But there are risks and dangers. Let’s first look at the optimistic case. Saying “big data” today sounds novel and to some people, perhaps a bit frightening. But a good way to think about it is to transport ourselves in our imagination to the mid 1500s in Europe, as algebra was undergoing a revival among scholars. Someone might ask: “who benefits and who loses”…?!
How could we answer that? Society benefits: we’ll build taller buildings, sturdier ships, unlock mysteries of science to improve health. Who loses? We’ll mathematise aspects of society and overlook more spiritual or humanist dimensions of life; only a handful of people will excel at harnessing the methods; we’ll use it to build weapons etc. Big data is similar: it’s a huge advancement in how humankind interacts with information around us, and will improve our lives. No one “loses” even if there are concerns on how it’s used.
Can causal relationships, as opposed to correlations, be identified using machine learning techniques? Is it a problem if they can’t?
Perhaps. On a purely statistical level, no -- you only have a correlation. But if you establish a method to find causality, there’s no inherent reason why you couldn’t find causal relationships just because you’re using more data. But then you’re doing a causal experiment, and sort of ruining the benefit of relying on big data -- tapping the vast amount of data in the wild to give you a rough-and-ready answer.
For lots of things, you don’t care about the causal relationship as long as you have the correct answer. For example, why do people who buy Murakami novels like Hemmingway but not Margret Atwood? We could do interviews and psychological studies to possibly learn this (invented example)? But why bother, if just by knowing it you can make better book recommendations and increase sales?
If you would like to be sent a list of our interviews and puzzles each week, sign up to Black Swans here.