This all started a couple of months ago. Our little corner of the blogosphere was in its periodic slump, and i thought to spice things up. Being the mischievous kid that i am, i decided to write a little applet to break Qwaider’s famed and/or touted spam protection mechanism. (Un)fortunately, i got extremely busy immediately thereafter, and dropped the idea.

~~ Anti-Spam in Blog Engines ~~

While i no longer plan to test Qwaider’s spam system, I think the idea and discussion is still valid. Before i begin, i must state that i am not an expert in spam fighting, so i simply can’t go into the messy details of this field. However, for systems such as blog comments, one can imagine a set of techniques that are used to detect whether something is spam or not:

  • Content: This is simple; if the comment contains certain flag words, or if the comment body contains multiple outbound links, then this is spam with high probability. While most systems such WP or Blogger, allow one link to be part of the comment body, some immediately send any comment with a link for moderation.
  • Embedded Tags: WP and Blogger allow you a restricted set of HTML formatting tags in your comment body. But this could also be malicious if you add script tags. Again, some engines also takes the defensive strategy of stripping out all HTML tags.
  • IP Address: Many systems build and/or rely on databases of IP addresses that are known to produce spam, so if your comment comes from one of these addresses, then it is marked as spam.
  • DNS Checks: Many systems check whether the domain from which the comment is originating is known to be a “safe” domain and whether it matches the other IP info.
  • Speed: It is unlikely that a human being will comment on 10s or 100s of blog posts within a few seconds, so if the system sees multiple comments from the same machine being issued in close periods, then that signals spam.
  • Reputation: After your system ages a bit, it can assign reputation to the different addresses (clients) that communicate with it. People who frequently comment on a certain blog would gain reputation in the system for being trusted clients. Through their many submissions, the system can assign them a pool of addresses from where they are likely to send their comments. That along with their unique system identifiers (usernames, email addresses, and their internal system GUIDs) would allow the anti-spam engine to trust comments that match their IDs. This can allow the system to relax the strict rules for reputable users, and enforce them for the known bad guys.
  • Manual vs. Automatic Submissions: One can quickly make a program to directly invoke the HTTP commands to submit a comment to a blog. However, legitimate users are unlikely to do that. So one can overload the comments HTML form (via the button or JS commands triggered by a keydown in the comment box) so that it flags that this comment has been typed by a human rather than sent automatically by a malicious script.
  • Page Rendering Triggers: Humans will need to view the post before they can comment on it. So the server can generate a timed unique ID for a page when it is requested by a client, and require the comment forms to use that ID in sending their comments.

~~ Limitations ~~

These are probably the major techniques that many blog engines (such as Qwaider’s) use for their anti-spam engines in their blogs. Of course, isolated blog engines can not use other techniques that are used by anti-spam email engines, and even anti-spam hardware. Unfortunately in a blogging environment, you do not have a contact list to assist you in flagging emails, you can not (correctly) anticipate who will post a comment on a particular post, you can not use in-hardware checks on the actual IP messages, and you can not compute checksums on the data and such.

~~ Attack Strategy ~~

At this point, the strategy for a successful “attack” applet is simple. It has to mimic the actions of an actual community of human readers. This forces us to re-examine the goals of the applet; do we intend to overload the server and bring down the site ? do we intend to add a bunch of links to undesired sites ? or do we intend to litter the site with a  bunch of meaningless “comments” to annoy the owner and readers ? This turns out to be an important point that i’ll get back to later. However, for now, let us assume that our goal is to just litter the site with a bunch of garbage comments.

Given the goal and strategy above, the implementation is almost straight forward. One could write a small applet and link to it from a site within the community of the target blogger. For example, i could link to that applet from here since most of my readers read Qwaider’s blog as well. Once the applet is loaded on a bunch of machines, it can go in a “round-robin” fashion and each time pick an instance to deliver the spam comment. For example, if the applet was running on my machine, Hani’s, KJ’s, and Tooteh’s, at each epoch, only one of us will deliver that spam (assuming that the epochs are well spaced). Let’s say that it is my turn to deliver the spam; my applet would request a random recent post from Qwaider’s blog, and load it in an internal browser (the same kind that is used to automate web UI testing), it then grabs a random sentence from the post, simulates a user input event in the comment form, and submits a comment with that random quote from the article. These automated browsers exist for testing web UI and AJAX, and by faking the browser identification, the server can not know that it served a request to a fake browser or a real browser.

From the server’s perspective, the distributed nature of these applets insures that each spam comment is legitimately originating from a different address. The content of each message is not suspicious because it uses the same credentials of the users it is mimicking. It also simulates browser behavior so that is indistinguishable from normal activity.

~~ What is Spam ? ~~

This begs the question of whether this really is spam or not. After all, the spammers did not gain anything from their actions except the annoyance of the blog owner and readers. However, what is spam really ? If you as a blog owner are getting a flood of meaningless “comments” that are obstructing the dialog between you and your readers, then that is logically classified as spam. Notice that this does not rely on computational or statistical properties of the “spam” messages, but rather on their purpose and effect. So this means that all those automated anti-spam checks will never be able to catch all spam messages. Do “human-tests” and Captchas work ? No! because by this definition, humans could as easily manually type spam messages into comment boxes.

In the Computer Science literature, there is something known as the “Turing Test” in which a human chats on his machine with another entity. The test is passed if the human chatter can not distinguish if he is talking with a computer or another human. Currently no program can pass the Turing test. Does this mean that we will only be able to solve the spam problem completely when we finally pass the Turing test ? Actually, No! For those of us who live in areas with mail services, how many times do we get a letter in the mail and throw it out because we think it is just junk, only to discover later on that it was actually something “real” and important ? This actually happens quite often. So if even humans can not always distinguish spam correctly and accurately, then it is really impossible to solve the spam problem completely, correctly, and accurately. Then again, is there a single source that can say whether something is spam or not ? What you might consider as a spam message, i might consider as a legitimate and important messages. Take the weekly local sales flyers that you get in the mail for example. To me, that is always unsolicited junk mail. To others, that is valuable info. So truly, spam is in the eye of the beholder.

~~ So is Working on Spam Problems Hopeless ? ~~

Although we can never identify all spam correctly and accurately, there is still alot of work to be done in identifying messages that are obviously spam. That is where the crux of the anti-spam work and research is focused, and we still have a long way to go.

Anyway, these are my thoughts on this subject. I picked Qwaider as an example only because he is relatively well-known in our community, he built a good anti-spam engine, and i think he can take being picked upon :-)