Mining the complete text of 4 percent of the world's books, Harvard University and Google scientists used a powerful new research tool unveiled today to glean surprising insights into language, culture, and history.
Books already tell stories, but when their words are combined and analyzed with computational tools, they tell bigger stories. By studying billions of words that appeared in books published over the last 200 years, the researchers found that references to God have been dropping off since about 1850. People are becoming celebrities earlier in life now than in the past, but their fifteen minutes of fame are passing more quickly as their names drop out of the lexicon. References to past years are becoming less present in culture. Censorship leaves a discernible pattern that may be useful to identify propaganda or suppression of victims.
The findings, the fruits of the ambitious Google project to digitize every book in existence, were published today in the journal Science. They are a tantalizing first glimpse at what researchers think may become a transformative new tool for humanities researchers.
Google is publicly launching the new tool, Google Books Ngram Viewer, to allow scholars or the simply curious to ask questions – such as when references to “The Great War,” which peaks between 1915 and 1941 was replaced by the term “World War I.” The tool currently allows people to look up words or phrases that range from one to five words, and see their frequency over time -- the number of times a word is mentioned divided by the total number of words written that year.
"This is really the largest data release in the history of the humanities -- a fantastic wealth of data,” said Jean-Baptiste Michel, a postdoctoral researcher in the program for evolutionary dynamics at Harvard. “In our paper we present our initial investigation -- we explore this new terrain, we dig a little bit. It is a very cool feeling to have, but what people will be able to do will far exceed everything we have done.”
The study, led by Michel and senior author Erez Lieberman Aiden, who runs the multidisciplinary Laboratory-at-Large at Harvard’s engineering school, drew on a wide and unusual array of collaborators -- not only from Harvard and Google, but also from Encyclopaedia Britannica and the American Heritage Dictionary.
Michel and Lieberman Aiden had previously worked together on a 2007 study in the journal Nature that tracked the evolution of language through a much more painstaking process – hunting down obscure old books and reading them to discover the linguistic heritage of modern verbs. As they were wrapping up their study, they began to notice obscure books becoming available through Google Books, the initiative that has now scanned 15 million books, or more than 10 percent of published books, according to Jon Orwant, engineering manager of Google books, which has a large presence in Cambridge.
Michel and Lieberman Aiden, realizing that their research techniques would soon be antiquated, approached Google and began a collaboration with the goal of creating a tool that could be broadly useful to humanities researchers.
“As we’ve amassed more and more information that isn’t available elsewhere, I started to realize we’re sitting on these troves of data that are very useful,” Orwant said -- not just for web users searching for the answers to specific questions, but to the scholarly community, too.
The efforts are part of a much broader push to try and bring the power of analyzing large datasets to the increasingly digitized world of humanities research.
“If you look at what humanities scholars have studied for hundreds of years, they tend to study things like books, music. The difference today is those are digital and you have the potential of searching and ‘reading’ much larger amounts of this information than you ever could before,” said Brett Bobley, director of the office of digital humanities at the National Endowment for the Humanities.
Such research would not supplant humanities’ researchers current methods, Bobley said. But it could supplement work and broaden the scope of research questions which might otherwise be limited by how much individuals can read and remember.
Researchers calculated, for example, that just reading the books from the year 2000 in the two-century dataset used in the Science paper would take 80 years -- without interruption for meals or sleep.
In their first analysis, researchers used the dataset to look at changes in grammar and English, finding that about half the words that appear in books are “dark matter” that do not appear in reference books -- words that may be compound words, proper nouns, or just are undocumented, like "aridification" or "slenthem." English, they found, is growing by about 8,500 words a year. They have also looked at collective memory -- and forgetting. Authors are also letting the past go more quickly. The year “1880” had dropped to half its maximum number of references 32 years later, in 1912. But it took only a decade for “1973” to decline to half its prominence.
Already, Google engineers who began beta-testing the website this week have come up with some fun searches (“supercalifragilisticexpialidocious” or “pirates” vs. “ninjas”). But the real test of the tool to push knowledge forward will come when historians, sociologists, and others find ways to use the tool to ask complex questions.
Franco Moretti, codirector at the Stanford Literary Lab, praised the methods and the findings of the study. Going forward, digital humanities researchers have increasingly powerful tools, but the challenge will be interpretation -- finding links between quantity and meaning.
“Just as it makes an enormous difference [for paleontologists] whether a bone fragment belongs to a creature's tail or neck, so it makes a great difference whether the word ‘God’… occurs as a self-explaining given, in a discussion of principle, or as a banal interjection; whether, in a play, it is used more often in soliloquies, love duets, or public scenes; and so on,” Moretti wrote in an e-mail.
About white coat notes
|White Coat Notes covers the latest from the health care industry, hospitals, doctors offices, labs, insurers, and the corridors of government. Chelsea Conaboy previously covered health care for The Philadelphia Inquirer. Write her at email@example.com. Follow her on Twitter: @cconaboy.|
Gideon Gil, Health and Science Editor
Elizabeth Comeau, Senior Health Producer