Click to translate
Millions of people are deciphering vintage texts without knowing it - and forging a new path for computing.
- |
It happens all the time: you're registering a free e-mail account or making a purchase online, when up pops a wavy, multicolored word. The system asks you to retype the word - and you roll your eyes, squint a little, and transcribe. This little test is one of the most successful techniques for making sure the person trying to log on is really a human, and not a digital "bot" prying into the site.
But now, when you type that word, something else may be happening as well: You may be deciphering a word from a decaying old book, helping to transform a historic text into a new digital file.
In May of last year, computer scientists started using those cryptic-looking words to solve a frustrating problem. Digital cameras at libraries worldwide are scanning millions of pages of old books, automatically "reading" the texts and turning them into computer files. But as books age, their typography smudges and flakes away. While human readers have little trouble comprehending even the most mangled words, sophisticated computer software still hangs up on them. Somewhere on the page, the dot of an i has disappeared, the smile of an e has gone gappy, the belly of a capital D has detached itself from its backbone. The computer thinks it's seeing an 'l,' a 'c,' and a capital I followed by a parenthesis.
In a paper published last Friday in the journal Science, computer-science professor Luis von Ahn describes a new system to solve this problem. Taking advantage of humans' natural ability to decipher messy text, von Ahn's system places tiny bits of those unreadable lines as mystery words on websites around the world. As people solve the usual logon puzzle, they also decode a real word; the results are then collected and used to correct the text and produce clean copies of scanned books.
The system is the latest incarnation of what von Ahn calls "human computation," the idea that you can network human brains to solve problems computers still can't handle. Similar systems are being used to identify images, describe music, and gather common-sense facts about the world to build a more convincing computer intelligence. In this idea, von Ahn and other specialists see a powerful set of tools for processing information and solving problems that range from improving search results to translating difficult documents.
Von Ahn, an assistant professor of computer science at Carnegie Mellon University, helped develop the original twisted-word security technique, known as CAPTCHA - a slightly fractured acronym for Completely Automated Public Turing test to tell Computers and Humans Apart. (The "Turing test" refers to mathematician Alan Turing, who in 1950 proposed a simple way to measure the success of artificial intelligence in computers.) Since appearing on the Alta Vista search engine in 1997, the technique has become nearly ubiquitous on the Web; according to von Ahn's Science paper, people solve about 100 million captchas per day.
"We felt good about this," von Ahn said in a phone interview last week. "But at the same time we felt bad, because we realized that people were wasting something like 500,000 hours per day" interpreting deliberately garbled words.
Interested in showing that people's efforts could be usefully harnessed in tiny increments, von Ahn turned to the problem of scanning texts. Even the best automated text readers - called optical character recognition, or OCR, software - fail to recognize up to 20 percent of the words in old printed matter. Thus big book-scanning projects like Google Books and the Internet Archive need to employ human reviewers to decode these troublesome words. Von Ahn realized that this job was precisely what people solving captchas were already doing.
He and his team devised an elegant system for collecting troublesome words, turning them into captchas, and getting them solved. Books are scanned twice and the two text streams are compared; any mismatched words become captchas. The mystery words are paired with known words on normal website security checks, and the user is asked to solve both words. If the user is right about the known word, his or her answer for the mystery word is kept and compared to solutions offered by others. Von Ahn finds that the system correctly decodes mystery words more than 99 percent of the time - results nearly identical to that of the scanning projects' human reviewers.
According to the Science article, this system, dubbed "reCAPTCHA," is now used on some 40,000 websites, where it has solved some 44 million words in one year of operation - the equivalent of about 17,600 books in von Ahn's estimation.
It may be a telling fact about the Internet that von Ahn was not the first one to this idea: online pornographers trying to unlock captchas (and gather up millions of e-mail addresses) realized that they could solve them by the thousands through a neat trick. Whenever their bots run into a puzzle, they take a snapshot of the captcha and shoot it back to the porn site, where viewers have to solve it to move on to the next picture.
Like pornographers, von Ahn tries to appeal to our desires to get us to do what he wants. But while pornographers want our money, von Ahn's system just piggybacks onto something we're already doing - logging in - and puts it to socially redeeming use. It's elegant and efficient: I get my Gmail account, and a rare and irreplaceable text begins a new life in the electronic age.
Von Ahn has developed other human-computing projects as well, in which the motivation tends to take the form of games. In one game, originally called ESP, players paired anonymously over the Internet were shown an image pulled from the Web, and then tried to guess the words their partner will type in response. The more quickly partners hit upon the same word, the more points they earned. The words they chose, meanwhile, could be harvested to provide an accurate caption for the image. The system has been licensed by Google; now called Image Labeler, it is used to improve results in Google Image Search.
Ethan Zuckerman, a fellow at Harvard's Berkman Center for Internet & Society, shares von Ahn's conviction that human computation offers a useful approach to problems that bedevil computers. "Computer scientists - understandably - are more interested in solving problems with algorithms than by figuring out clever ways to slice them into small pieces and let humans solve them," he wrote in an e-mail last week.
Among the things we do better than computers, he said, is human language itself, and translation in particular. Zuckerman's own research focuses on access to Internet technology in Africa, an area in which effective translation is paramount. Human computing, he said, is "a much more likely path towards solving the problem of the polyglot Internet than improved machine translation, in my opinion."
But human computation has its doubters as well. David Weinberger, a colleague of Zuckerman's at Berkman, argued in a recent e-mail that human brain computing, in a way, sets a dangerous precedent about the meaning of free time and effort. "The discussion goes down the wrong path if we let ourselves think that the brain is like a timeshare computer that, if not used optimally, is a wasted resource," he wrote.
There's also a potential for misuse: While von Ahn is employing people to crack old books, after all, pornographers and spammers are doing the same thing to breach security walls. Von Ahn acknowledges that problem with such systems, but says "it's no reason not to use them for good as well."
Jessamyn West, a library technologist, points out that von Ahn's tools and games - which people can sign up for voluntarily - give people a rare chance to do small nice things without leaving their desks. "I think people like feeling like they're helping," she said. "They recycle, they pick up litter, they'll pick a penny or leave a penny."
In a time when the Net seems overrun with spammers and trolls preying on our ignorance, desires, and fears, she finds it cheering to know that people like von Ahn are looking for ways to reward cleverness and generosity - a micro-ethics of clicks, games, and tag clouds.
Matthew Battles is a freelance writer in Jamaica Plain![]()


