Big data: Mind the gaps
These days, we can measure a lot about the world – but only the fraction we know about.
This story is from BostonGlobe.com, the only place for complete digital access to the Globe.
Whatever the reasons for our bias, what we like, see, and measure dictates what we know the most about. When it comes to planets outside the solar system, for example, the bigger exoplanets are easier to discover. Therefore, they are the ones we know the most about.
This can sometimes contribute to misleading results. One well-known phenomenon in psychology is that experimental subjects are generally WEIRD. What does this acronym mean? When it comes to performing studies, psychologists overwhelmingly end up relying on Western, Educated, Industrial, Rich, and Democratic subjects (especially, given that many psychologists work at universities full of students looking to pick up a few dollars, WEIRD undergraduates). Research has found that such traits as visual perception and fairness are far from universal, and there are numerous differences between WEIRD and non-WEIRD populations—and so decades worth of experimental psychology results may not turn out to be nearly as universal as once thought.
Of course, many clever experiments and studies can be done even with a relative paucity of data. And when we want to learn more, we push to collect more data. We create moon shots, launch space probes, build massive particle accelerators, conduct global marine surveys, and much more. We conduct large efforts that are designed to stretch what we know, and to avoid our biases. However, we’re inevitably hampered by the unknown unknowns.
Which brings us to Big Data. The huge pools of data being generated today aren’t evenly distributed. Rather, Big Data is a series of deep wells, each one plumbing the depths of certain topics. We have a lot of mobile phone data, and Facebook is throwing off huge amounts of information. We can even mine credit card purchase data. But that doesn’t mean that we know everything.
Just because we know how people with iPhones interact with their phones doesn’t mean we know how everyone interacts, in all situations. Or, knowing how information spreads on Facebook doesn’t necessarily apply to idea adoption in general. These insights can be useful, and they’re certainly much better than simply trying an experiment on a small group of Ivy League undergraduates. But we need to be cautious about how we generalize about them and what kinds of conclusions we draw.
As personal data accumulate, there are big blind spots that researchers already know about. Datasets about how companies grow and develop is spotty and rare. Data about elusive topics like creativity—for example, how new ideas are formed—are far from robust, despite all the business books written about the subject. For all the published data on successful science, there is almost a complete lack of data from unsuccessful scientific experiments, which could be just as useful in the aggregate, if not more so. We lack large datasets that detail how infectious disease works its way from person to person, a problem that would be enormously beneficial to tackle.
Big Data might be deep, in other words, but it’s not wide. We’re certainly getting there. But until we have lots more, or distribute it more evenly, we always have to be aware that we might be dealing with some informational bias. As you hear the latest claim about big data and its promise, it’s OK to be excited—but keep in mind that we may be finding the triceratops, and thinking we understand everything there is to know about dinosaurs.
Samuel Arbesman is a senior scholar at the Kauffman Foundation and a fellow at the Institute for Quantitative Social Science at Harvard University. His first book, “The Half-Life of Facts” (Current/Penguin), from which this article is adapted, was published last week.