reCAPTCHA! What it is, and what it does
65In the beginning, there were books.
Lots & lots of books.
Before the internet, a book was the best way to share information. Permanent, portable, printed pages. (Full disclosure: I am a printer. Who loves books. Try relaxing in a hot bath with a Kindle.) Important and interesting information has been rolling off the presses world-wide for hundreds of years.
Our generation has a different media. While some of us do carry library cards, many more of us are carrying WiFi cards. How do we get that inconceivable amount of printed data transferred to the web?
If you are picturing a warehouse full of monkeys typing away, well... actually that would be kinda cool. But wrong. Nope, the magic happens with scanners.
I concede that one can indeed relax in a hot bath with a Kindle.
|
Kindle Wireless Reading Device (6" Display, U.S. Wireless)
Price: $259.00
|
|
TrendyDigital WaterGuard Waterproof Case for Kindle, Purple Border
Price: $16.99
|
|
Car Power Adapter / Charger & USB Cable COMBO for Amazon KINDLE 1 or 2 or DX & SONY READER Digital Book
Price: $39.99
List Price: $79.99 |
|
Mighty Bright XtraFlex2 Clip-On Light (Black, Kindle Version)
Price: $19.99
|
Books are scanned or photographed one page at a time. Once a page has been converted to a digital image, it is analyzed with OCR. Optical Character Recognition software identifies letters based on their shape.
If the quality of the scan is poor, the software may misinterpret a letter. Is that an i or an l? Add into the mix faded ink, yellowed paper and funky fonts, and you’ll get a fair amount of OCR failure.
Webmasters, bloggers, and retailers protect themselves and their users from spam and fraud by using a security system called CAPTCHA. The phrase, coined in 2000, is a contrived acronym: Completely Automated Public Turing test to tell Computers and Humans Apart.
A Captcha image requires visual perception, not simply recognition, to translate.
People can read it.
Bots can’t.
Luis von Ahn, an assistant professor of computer science at Carnegie Mellon University, was involved in the original development of Captcha. Mr. von Ahn was seeing excellent results from the security system. But, as he stated, “It takes about 10 seconds to type each Captcha. I realized that humanity as a whole is wasting 500,000 hours every day typing Captchas.”
OCR systems are only accurate up to 80% of the time when scanning older books and newspapers. Some pages have degraded. Some words are blurry. Through time, paper yellows and ink bleeds. The most sophisticated software is still not capable of perceiving difficult images.
Combining the successful Captcha security system with indecipherable words from scanned books, von Ahn developed reCaptcha in 2007.
reCaptcha serves up OCR failures to the best translators around: us. When you type in those two security words, you are doing more than proving yourself human. You are translating a blurry, distorted image into a word that the best OCR software could not decipher. It’s estimated that the dual-purpose reCaptchas are correcting more than 10 million words each day.
In September 2009, reCaptcha was purchased by Google. The master of search engine technology is using reCaptcha’s translating power to improve Google Books. Prior to the purchase the translated scans were used by The Open Content Alliance, a nonprofit group, to create the Internet Archive. The Internet Archive provides free access to over 1.25 million books. They are limited to public domain works, meaning books who’s copyrights have expired.
While continuing to catalog public domain titles, Google Books has also expanded access to include some in-copyright books and out of print books. Previews will be provided and the books will be available for purchase. Additionally, Google Books is working with libraries from Cornell, Harvard, and Oxford Universities, among many others, to allow full on-line access to their collections.
Ten seconds at a time, we are building our own digital library. Generations of works are being preserved and offered up for all people to access.
What a legacy.
- The Internet Archive
Explore what your ten seconds of donated labor have created.
- Google Books Settlement Agreement
Google has settled a class action lawsuit filed by the Authors Guild, the Association of American Publishers, and other authors and publishers. Please read the details here.
PrintShare it! — Rate it: up down flag this hub





jacobkuttyta says:
2 months ago
Nice hub, well written.