reCAPTCHA

A CAPTCHA

Unless you haven’t used the internet in the past 3 years or so, you are probably very well acquainted with CAPTCHAs already; they are these morphed letters or numbers you see on blogs and signup forms that are supposed to prevent spam. They are a product of research by the “human-based computation” / cryptography researcher Luis von Ahn (also here) at Carnegie Mellon University.

CAPTCHAs solve a simple problem; Spammers can write small applications that scan for input forms and fill out these forms automatically and repeatedly. If you run your own blog, you probably noticed this (spam actually isn’t written by people usually 🙂 ). So the answer to this symptom is also simple, if spammers are using computer applications to automatically propagate spam, make part of the fill out form something that the computers can not recognize. So a CAPTCHA creates a sequence of characters and morphs it so that the program that is generating it itself can not read it (and hence other programs won’t be able to read it either), and puts it on the forms. We, being mostly human, are actually able to read and decode that morphed text, and thus by solving these little problems, we can assure the web application that is receiving this form, that we are indeed humans (not spam applications).

Needless to say, the idea of CAPTCHAs was a big hit. These things sprung all over the web and are now used everywhere. Recent studies put an estimate number of how many seconds does each average web surfers spend daily solving these CAPTCHAs. Although the amount is small for a single person, aggregating this amount over all internet users gives us a huge huge number of wasted “human cycles”. So, Luis decided to utilize that time.

As many of you know, there is a huge effort to digitize old books for which no electronic copies exist. Digitizing these books uses a process called OCR which is also provided by your average home scanner. Unfortunately OCR is a bit error prone and some words are not recognized correctly. However, given the awesome power of recognition that people poses, even though OCR programs might not be able to understand certain words, we can. But, who will be willing to sit in an office all day “recognizing” words ? This is a mundane job to say the least. Can we somehow outsource this job to the masses (without paying them 🙂 ) ?

Introducing reCAPTCHA

Connecting the dots seems easy now, and a re-engineering of CAPTCHAs is made, and out comes reCAPTCHA (again by Luis and his team). A reCAPTCHA is just like your ordinary CAPTCHA, but instead of one morphed word you now have two. One of these words the server already knows the answer for, and the other one is unrecognizable by the OCR program. The thought is, if the entity filling out the form knows the answer for the thing that i already know, then it is highly likely that it also knows the answer for that other thing that i don’t know.

So, now every time you fill out one of these reCAPTCHA forms, (or comment on this blog 🙂 ) you are helping digitize one word. If this spreads wide on the internet, we can have our old library digitized in no time.

Anyway, after this long introduction and motivation, i just would like to announce that i am adding reCAPTCHA to my blog. There is a ready wordpress plugin for it so i urge you all to add it if you can.

Leave a comment ?

14 Comments.

  1. … and it works 🙂

  2. Cool idea. I hope Blogger adopts it soon.

  3. That was educational….:)

  4. I’m not really a big fan of Captchas, proponents have solved all the issues of accessibility …etc
    But it’s only a matter of time until these get broken too. There’s a lot of research on computer applications breaking other computer applications. It will eventually be broken
    What spammers are resorting to these days is another very simple method. Grab the image from your captcha, present to to people in exchange for porn, store the results in a database and done! The whole thing is going to crumble sooner or later because it’s based on a faulty premise
    “Computer generated image the the computer can’t understand”
    I think everyone needs to get a little bit more intelligent, and deprive the spammers from their sources. Akesmet are doing a great job. Others have similar initiatives (I keep a database that is 6 times larger than Akesmet specifically for this problem) That’s why I get so little spam even though I’m just like everyone else get targeted by hundreds of thousands of spam attempts

    And yeah, Captchas suck! They’re ugly, they’re in your face, and they annoy the hell out of people

  5. alajnabiya, Dandoon: Thanks.

    Qwaider: Yeah, Captchas are not the ultimate solution, but they are something. Actually Luis talked about tracking “Captchas sweat-shops” were people were payed money to sit all day and solve Captchas. And his solution was simple. Since these spammers can only type a certain number of words per minutes, for the spammers ip address, they would just send them longer and longer captchas to solve :-). In reCaptcha that just meant that the spammers are actually helping more in digitizing books. So they are using them for a humanitarian cause 🙂

    However, i don’t think that spammers are capable of solving all possible captcha images and storing them in a a database. Assume that the captchas length ranges from 3 to 7 characters over numbers, small case letters, and capitalized letters. That gives us more than 4 x 10^12 possible unique strings, and imagine that each string can be created into about a hundred unique Captchas, then that is a total of about 4 x 10^14 unique captcha images and their solutions. Spammers will have to spend a ton of time and effort to enumerate all captchas.

  6. That’s the extreme case that you’re talking about. You can’t use words like “I”, “me” or “SlkjuXkuerSSFSJHFLdjh” in any captcha. Most likely the captchas are going to be like the words I have below “notes” and “mercy”
    Just like there are databases for, say MD5 hash strings being used by hackers all over the world. It’s not really far fetched to create a database for images. Or the better choice, create more intelligent OCR. It’s obvious that these people have the means, resources, and the motives. It will happen, there’s no escaping it. It’s just matter of time.

  7. Thanks for the heads up–interesting concept.

  8. Qwaider: Yeah you are right. I guess with reCaptchas the words are limited to those of the English dictionary, and thus with lots of time and effort one can enumerate all possible images and their answers. But I guess even if a well funded group of spammers does that, then their efforts would have actually helped digitize alot of books 🙂 . But yeah i agree with you. Captchas in general treat the symptom rather than the cause. Thus they should not be considered as THE answer for spam, but rather as a temporary solution. A true solution would fight the problem at the source.

    Ruby: Thanks 🙂

  9. I have been working on this for so long that I came to this conclusion
    There’s no single way that can combat spam on it’s own. Starting with depriving spammers from their beloved zombies and proxies, then going along the nofollow lines, and combining that with contextual smartness and most importantly booby trapping them inside the HTML code. All of these help eliminate spam.
    (I use all these techniques on my blog and a few cute other tricks and the results have been very impressive so far) Sadly, not many people or developers are able to do this in an efficient way yet. And I’m afraid the same issues with emails are just waiting to be translated to comment spam.
    Couple of noteworthy projects. Project Honeypot, actually tracks, captures and provides help for people who want to look up spammers. Very noble idea. And “Stop forum spam” (I think) another very interesting endeavour to stop these spammers.
    One day I will share my list too 🙂

  10. Qwaider: Oh cool. I always wondered why you don’t have any “visible” anti-spam stuff. But it turns out it’s all “under the hood” 🙂 .

    Funny that you mention Honeypots, i almost got to work on them in undergrad. They wanted to make sure that their work footprint mimics that of machines that are actually used.

    Can’t wait to see your list. Personally, i don’t have much on this site. Just Aksimet and reCaptcha. Actually, Aksimet has not been catching any spam after reCaptcha has been installed 🙂

  11. By the way, just to let you know. I just implemented recaptcha for my registration routines 🙂 based on your recommendation. Yalla .. 3eesh 🙂

  12. Qwaider: Hahaha .. yislamo

  13. hmmm, it’s then relatively easy to guess which one of the words is in the database, and screw with the recaptcha guys. The last one was ‘Sleigh thongs’, but ‘Xxxxxx thongs’ works fine too…

  14. Which also means that the chances to defeat any given captcha are significantly better than they appear at a glance.

Leave a Comment


NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>