I just discovered a few days ago what “Stop spam, read books” meant.

You may have seen one of these annoying boxes when you tried to submit a comment to a blog or a forum. Of course, these boxes are used to stop spam. The read books part comes from the fact that there are two words. One word is a computer-generated word, and another word is from an old text.

You really only have to guess one of the words right in order to post your comment, but most people guess both of them anyway. Since the second word is used to digitize old books, reCAPTCHA doesn’t know what the word is. You can enter the word completely wrong and it will still take it.

So, someone can completely mess up the word digitizing through this process:

  1. Look at the two words and see which one’s from a book and which one’s computer-generated. Usually, the computer-generated word has more distortion in the word.
  2. Enter the two words, BUT… for the word that’s not computer generated, enter some garbage, like “kMMMMZy6%dh555Z{}[vn}” (no, that’s not my password.) For example, on the box shown in this blog post, one could enter “huntress fdupgo7x8” and pass.

Of course, this is just a theory. There might be some gnomes behind-the-scenes in the reCAPTCHA office scrutinizing user inputs, or they have Google’s technology to filter out this garbage.

Anyway, that’s one of the holes in the reCAPTCHA system. Another is that word recognition software can decipher these words. Maybe.

  1. Yeah, I’ve been writing profanity into digital books for years this way for years now.
    However, it recently occurred to me that this has probably occurred to them, and that they have probably enacted a very simple countermeasure: feed the word in question into multiple recaptchas, and take the most common input.

