When physicists look at language, what do they see?
For the five experts in physics who authored a recent paper in the journal Scientific Reports, language looks very much like a gas, with words bouncing around like particles.
The gas metaphor explains their paper title: “Languages cool as they expand: Allometric scaling and the decreasing need for new words.” By analyzing historical fluctuations of word use in the enormous collection of texts on Google Books, the authors claim that they can observe the lexicons of English and other languages “cooling”—changing more slowly, that is—as they grow, just as gases cool during expansion.
Among those who actually study language for a living, however, the reception to the paper has been mixed. A language’s vocabulary may indeed become less variable as it grows, an observation that could illuminate long-range patterns of how words rise and fall. But in this paper, some linguists discern basic misunderstandings about the way languages work. The tension gets at a question that’s becoming increasingly central as the social sciences embrace new quantitative tools: Can number-crunchers outside these fields use the data to make bold, useful contributions? Or do they need more specialized knowledge to be able to ask the right questions in the first place?
About two years ago, Google made major chunks of its book digitization project publicly available through the “Ngram Viewer” Web interface. Ever since, the unprecedented scale of the Google dataset, encompassing millions of books, has enticed scholars with the promise of a new, quantitative approach to language and culture—“culturomics,” as a pivotal Science paper dubbed it.
The “cooling” paper is the latest in a series of language-related studies from scientists who find the huge new pot of data too tempting to pass up. The lead author of the Scientific Reports paper, Alexander Petersen, told me via e-mail that the Google Books dataset offers an “opportunity to study the nanoscale properties of language” that would have been “unimaginable” even 10 years ago. “We saw an incredible opportunity to measure the dynamic properties of language at the microscopic scale of individual particles, and to observe how the system coevolves in response to external socio-technological forces,” he said.
Petersen, an assistant professor at IMT Lucca in Tuscany, Italy, earned his PhD in physics at Boston University, where he became interested in applying the methods of statistical physics to socially driven complex systems—ranging from the stock market to baseball. (For the latter, he credits Red Sox fans: “looking out onto Commonwealth Avenue from my desk during game days, seeing all the red jerseys emerge, was quite inspiring.”)
Tracking baseball stats is one thing, but why would physicists feel confident treading on the terrain of linguists? For the first time, the Google Books language corpus allows them to track long-term, macro-scale patterns of use. As one of Petersen’s colleagues, the Slovenian physicist Matjaz Perc, wrote in a paper last year, “it is precisely this detachment from detail and the sheer scale of the analysis that enables the observation of universal laws that govern the large-scale organization of the written word.”
But some linguists have noted problems with the physicists’ “detachment from detail.” University of Pennsylvania linguistics graduate student Josef Fruehwald, in a recent blog post on the “cooling” paper, notes that the researchers equate “language” with “the set of words which have been published.” A language isn’t simply a set of words; it encompasses structures all the way from individual sounds to the combination of words and phrases into syntactic patterns. “This ‘language is words’ axiom is part of most people’s folk linguistics that we have to train people out of when they take Intro to Linguistics,” Fruehwald said. “That’s why it’s a little hard to take the work of these physicists seriously at first glance.”
Even the identification of “words” in the Google dataset may be suspect. As he reported on the blog Language Log (to which I also contribute), University of Pennsylvania linguist Mark Liberman dug through the data and determined that the strings of characters that Google calls “words” include all manner of typographical oddities. Along with errors of optical-character recognition from the scanning process, the data is “noisy” thanks to variations in spelling, capitalization, and inflection. “The paper’s quantitative results clearly will not hold for anything that a linguist, lexicographer, or psychologist would want to call ‘words,’” Liberman observed.Continued...