Spam is in the Eye of the Beholder

This all started a couple of months ago. Our little corner of the blogosphere was in its periodic slump, and i thought to spice things up. Being the mischievous kid that i am, i decided to write a little applet to break Qwaider’s famed and/or touted spam protection mechanism. (Un)fortunately, i got extremely busy immediately thereafter, and dropped the idea.

~~ Anti-Spam in Blog Engines ~~

While i no longer plan to test Qwaider’s spam system, I think the idea and discussion is still valid. Before i begin, i must state that i am not an expert in spam fighting, so i simply can’t go into the messy details of this field. However, for systems such as blog comments, one can imagine a set of techniques that are used to detect whether something is spam or not:

  • Content: This is simple; if the comment contains certain flag words, or if the comment body contains multiple outbound links, then this is spam with high probability. While most systems such WP or Blogger, allow one link to be part of the comment body, some immediately send any comment with a link for moderation.
  • Embedded Tags: WP and Blogger allow you a restricted set of HTML formatting tags in your comment body. But this could also be malicious if you add script tags. Again, some engines also takes the defensive strategy of stripping out all HTML tags.
  • IP Address: Many systems build and/or rely on databases of IP addresses that are known to produce spam, so if your comment comes from one of these addresses, then it is marked as spam.
  • DNS Checks: Many systems check whether the domain from which the comment is originating is known to be a “safe” domain and whether it matches the other IP info.
  • Speed: It is unlikely that a human being will comment on 10s or 100s of blog posts within a few seconds, so if the system sees multiple comments from the same machine being issued in close periods, then that signals spam.
  • Reputation: After your system ages a bit, it can assign reputation to the different addresses (clients) that communicate with it. People who frequently comment on a certain blog would gain reputation in the system for being trusted clients. Through their many submissions, the system can assign them a pool of addresses from where they are likely to send their comments. That along with their unique system identifiers (usernames, email addresses, and their internal system GUIDs) would allow the anti-spam engine to trust comments that match their IDs. This can allow the system to relax the strict rules for reputable users, and enforce them for the known bad guys.
  • Manual vs. Automatic Submissions: One can quickly make a program to directly invoke the HTTP commands to submit a comment to a blog. However, legitimate users are unlikely to do that. So one can overload the comments HTML form (via the button or JS commands triggered by a keydown in the comment box) so that it flags that this comment has been typed by a human rather than sent automatically by a malicious script.
  • Page Rendering Triggers: Humans will need to view the post before they can comment on it. So the server can generate a timed unique ID for a page when it is requested by a client, and require the comment forms to use that ID in sending their comments.

~~ Limitations ~~

These are probably the major techniques that many blog engines (such as Qwaider’s) use for their anti-spam engines in their blogs. Of course, isolated blog engines can not use other techniques that are used by anti-spam email engines, and even anti-spam hardware. Unfortunately in a blogging environment, you do not have a contact list to assist you in flagging emails, you can not (correctly) anticipate who will post a comment on a particular post, you can not use in-hardware checks on the actual IP messages, and you can not compute checksums on the data and such.

~~ Attack Strategy ~~

At this point, the strategy for a successful “attack” applet is simple. It has to mimic the actions of an actual community of human readers. This forces us to re-examine the goals of the applet; do we intend to overload the server and bring down the site ? do we intend to add a bunch of links to undesired sites ? or do we intend to litter the site with a  bunch of meaningless “comments” to annoy the owner and readers ? This turns out to be an important point that i’ll get back to later. However, for now, let us assume that our goal is to just litter the site with a bunch of garbage comments.

Given the goal and strategy above, the implementation is almost straight forward. One could write a small applet and link to it from a site within the community of the target blogger. For example, i could link to that applet from here since most of my readers read Qwaider’s blog as well. Once the applet is loaded on a bunch of machines, it can go in a “round-robin” fashion and each time pick an instance to deliver the spam comment. For example, if the applet was running on my machine, Hani’s, KJ’s, and Tooteh’s, at each epoch, only one of us will deliver that spam (assuming that the epochs are well spaced). Let’s say that it is my turn to deliver the spam; my applet would request a random recent post from Qwaider’s blog, and load it in an internal browser (the same kind that is used to automate web UI testing), it then grabs a random sentence from the post, simulates a user input event in the comment form, and submits a comment with that random quote from the article. These automated browsers exist for testing web UI and AJAX, and by faking the browser identification, the server can not know that it served a request to a fake browser or a real browser.

From the server’s perspective, the distributed nature of these applets insures that each spam comment is legitimately originating from a different address. The content of each message is not suspicious because it uses the same credentials of the users it is mimicking. It also simulates browser behavior so that is indistinguishable from normal activity.

~~ What is Spam ? ~~

This begs the question of whether this really is spam or not. After all, the spammers did not gain anything from their actions except the annoyance of the blog owner and readers. However, what is spam really ? If you as a blog owner are getting a flood of meaningless “comments” that are obstructing the dialog between you and your readers, then that is logically classified as spam. Notice that this does not rely on computational or statistical properties of the “spam” messages, but rather on their purpose and effect. So this means that all those automated anti-spam checks will never be able to catch all spam messages. Do “human-tests” and Captchas work ? No! because by this definition, humans could as easily manually type spam messages into comment boxes.

In the Computer Science literature, there is something known as the “Turing Test” in which a human chats on his machine with another entity. The test is passed if the human chatter can not distinguish if he is talking with a computer or another human. Currently no program can pass the Turing test. Does this mean that we will only be able to solve the spam problem completely when we finally pass the Turing test ? Actually, No! For those of us who live in areas with mail services, how many times do we get a letter in the mail and throw it out because we think it is just junk, only to discover later on that it was actually something “real” and important ? This actually happens quite often. So if even humans can not always distinguish spam correctly and accurately, then it is really impossible to solve the spam problem completely, correctly, and accurately. Then again, is there a single source that can say whether something is spam or not ? What you might consider as a spam message, i might consider as a legitimate and important messages. Take the weekly local sales flyers that you get in the mail for example. To me, that is always unsolicited junk mail. To others, that is valuable info. So truly, spam is in the eye of the beholder.

~~ So is Working on Spam Problems Hopeless ? ~~

Although we can never identify all spam correctly and accurately, there is still alot of work to be done in identifying messages that are obviously spam. That is where the crux of the anti-spam work and research is focused, and we still have a long way to go.

Anyway, these are my thoughts on this subject. I picked Qwaider as an example only because he is relatively well-known in our community, he built a good anti-spam engine, and i think he can take being picked upon 🙂

Leave a comment ?

12 Comments.

  1. There a bunch of things that could help that are not mentioned here. Couple of them are “spam traps” where spammer indirectly identifies himself while attempting spam for the first time causing himself and his IPs to get flagged for further processing.

    The second thing is white-list processing. For IP’s and identities. Which is what many of the very popular sites (ZDnet, Cnet, Gizmodo…etc). The person has to be cleared to comment, either by forcing them to register, or by adding their IP (manually or automatically) to a shortlist of users that are allowed to comment.

    There are many tricks up the sleeves of anti-spam world, some that are being developed right now to identify heuristics, and utilize telemetry information, that is -unlike you have concluded above- can utilize underlying network infrastructure to trace back, the comment to it’s original site and see if say there’s a shared router that has previously spammed the site, taxing it with a higher spam mark just because it’s shared with someone who has spammed before.

    What you describe above of Hani, KJ and you doing the spamming is EXACTLY the scenario of a bot-net this network of compromised machines does this coordinated attack that might disrupt the service. Luckily, those types of attacks are getting more and more simpler to detect and stop. If I recall correctly, Google, Hotmail and others are able to detect and react in less than 45 seconds. (which is plenty of time to send few million spams) then spammers resorted to what’s known as “fast flux” networks which are bot-nets that keep changing IPs very rapidly. Again these are being identified and blocked in reasonable amount of time.

    Spam is not hacking, it’s not using the system beyond what it’s designed to do. In fact, it’s using the system in EXACTLY what it was designed to do. No one would dare say, “spam days are numbered” but I can confidently say that Captcha’s days are numbered (just today, one spammer broke my reCaptcha) Luckily, I had other precautions and the spam didn’t get through

    If you’re really bent on spamming my blog, you don’t have to go through a lot of hoops to do it, just copy-paste and you should be good to go. I seriously encourage you to go ahead and try, I mean, what better testing can I get?

    Go for it :), I’m a good sport…. I promise!

  2. Qwaider: Thanks for your comment.
    Yeah, i bet there are alot of things that are going on in the anti-spam world that i don’t know much about, and i think it is a really fascinating and interesting field.

    A couple of comments on your additions:

    #1: I really hate whitelisting-through-registration approach. I hate being forced to register on a news site for example, if i only just wanted to leave one comment. I don’t want to have a million usernames and passwords for the many sites i visit. I guess a global ID (openID) and single sign-on would solve this, but this is not there yet.

    #2: As for tracing the back-routes and utilizing the underlying infrastructure. I guess i meant for people like me who have their engine on a shared server (separate from WP), or for people who build and host their own blogging engines, it is probably still infeasible to really do nifty infrastructure checks in an efficient manner (specially since the hosting resources might not be that good to begin with).

    #3: I am interested in your input on my bot-nets attack that is mentioned above. In my mind, i didn’t have these participating nodes sending “spam” at a high speed. If i let each of them post a spam message every 30 minutes or so. The server probably won’t notice a storm of activities, but rather a normal flow, yet the annoyance will still be delivered to the blog owner and readers. What do you think ?

    And by the way, i agree with you; Captchas are just a temporary solution and their days are numbered.

  3. What you’re describing in #3 is called remaining within the threshold so you don’t trigger antispam. Fortunately, that means that the level of noise will remain very low, totally defeating the purpose and the feasibility model of most spammers (spam as many as you can in as fast as you can because conditions change)
    So yeah, it might be annoying but it’s not that big of a deal. Especially when it doesn’t contain any special links (or in other words, the bread and butter of spammers)

  4. What I don’t understand about the spammers is that they seem to think its effective.

    There must be someone out there clicking their links. On both emails and wordpress comments, as soon as I get spam, I make sure I do my part by pressing the report spam button. WordPress reports it to akismet, and my webmail is gmail which I’ve found handles spam much better than yahoo. in fact yahoo sends their own spam the bastards (particularily if you didn’t go deep into their preferences and configure the marketting settings). I haven’t tried hotmail.

  5. Qwaider: Yeah, you are right about that. I guess i was going for something that subtle and annoying .. but then again, it seems that it will have to do more in order to be *really* annoying.

    Hani: I know i know .. and this is what kills me. You’d think by now that all internet users can recognize spam and ignore it. But it seems that every now and then somebody clicks on a link somewhere. Specially if the true link address was hidden and not visible directly. For example, i might be telling you about a super cool and awesome site, but it turns out to be a spam site. 🙂 I don’t know. But somehow, spammers still find value in spam.

  6. Oh it’s effective alright, folks you have to understand the magnitude of these attacks to realize that they’re playing the numbers game to their advantage.
    So if they manage to get 1 in 1,000 spams through. That’s few million sites with tons of spam.

    There is still hope, there are very intelligent people working around the world to keep this under control, but again, the numbers are against them. For every PhD working to combat spam, there’s 100 in Russia, china, turkey, Israel and many other places working against them.

  7. Some of the leading anti-spam programs come from Israel.

    Know why?

    1. Because some of the top people in high tech live in Israel.

    2. Because Muslim terrorists try to flood Israel with it:

    http://www.ynetnews.com/articles/0,7340,L-3324623,00.html

  8. Emet, And
    3. Because a lot of spam comes from Israel also
    4. Because any site that has any anti-Israeli-oppression/pro-palestinian-rights and is NOT terrorist by nature is attacked by Israeli spammers/hackers

    So with home grown bastards like that, why wouldn’t there be some good anti-spam programs coming from Israel? They have the perfect breeding ground

  9. Like the article says, FAR more is targeted at Israel by Islamic cyber-terrorists than coming from Israel (and it is a crime here- unlike in the barbaric Jahiliyahland around us.

  10. Can you point me to the article in the Israeli law that prohibits terrorism? Because apparently it’s not being practiced at all. Or maybe it’s like when Israeli’s murder little children it’s self defence while when a militant attacks the Israeli army he’s a terrorist!
    Oh and does Israel has any anti-cyber-terrorism law enforcement agency? Because I have a ton of IPs with all documented violations that I would like to submit to them

  11. There are lots of laws that prohibit terrorism. It is being practiced, which is why you people are so sad.

    Terrorism and countrer-terrorism are not the same thing. You need to learn that- while you still can.

    For instance, Arab terrorists use their own women and children as human shields (or combatants, but that isn’t an issue). While trying to murder Israeli civilians, Arab civilians die. Legally and morally, ALL resulting deaths are on the heads of the Arab terrorists.

    EVEN THE PALEOS ADMIT THAT THEY USE THEIR OWN AS HUMAN SHIELDS

    Press Releases
    Ref: 54/2008
    Date: 15 June 2008
    Time: 09:30 GMT

    Stop This Tragedy!

    PCHR Concerned Over Casualties Caused by Continuing Internal Explosions: 8 Palestinians, Including a Child, Killed and More than 40 Injured After Explosion in Beit Lahia

    PCHR is deeply concerned about the recurrence of internal explosions as a result of weapons being manufactured, and stored, in areas populated by civilians. These actions are threatening the lives and property of Palestinian civilians. PCHR calls upon Palestinian resistance groups to take immediate measures to ensure the non-recurrence of such explosions. The most recent explosion, in Beit Lahia on June 12, killed eight Palestinians, including an infant, and injured at least forty others.

    According to investigations conducted by PCHR and eye-witness testimonies, at approximately 13:30 on Thursday, 12 June, a huge explosion occurred in a 400-square-meter house belonging to ‘Abdul ‘Azim Khaled Hammouda in the centre of Beit Lahia town, in the northern Gaza Strip. The house was completely destroyed and dozens of neighboring houses were also damaged, five of them seriously. Ambulances and civil defense crews rushed to the area, and removed victims’ bodies from beneath the ruins of the destroyed house, and neighbouring houses. The victims included 4-month-old Nour Majdi Hammouda, who was killed whilst inside her family home, and 16-year-old Mahmoud ‘Ataya Hammouda, who was killed whilst walking near the site of the explosion.

    The ‘Izziddin al-Qassam Brigades (the armed wing of Hamas) stated in a press release issued on 13 June, that 6 of its members were killed “while they were in the final stage of preparation for a special Jihad mission.” Those members have been identified as:

    1) Ashraf Na’im Mushtaha, 32;
    2) Hassan Mohammed Abu Shaqfa, 28;
    3) Majid ‘Aadel Hammouda, 28;
    4) Mohammed Sabri Abu Naja, 25;
    5) Mohammed Hamdan Miqdad, 22;
    6) Ahmed Muneer Subaih, 20.

    In light of the above, PCHR:

    1) Warns of the dangers caused by continued manufacturing or storage of explosive devices by Palestinian resistance groups in civilian-populated areas, which threaten the lives of Palestinian civilians and violate international humanitarian law.

    2) Calls upon Palestinian resistance groups to take effective steps to ensure the non-recurrence of such incidents.

  12. Makes me sick to just read your propaganda … Do you even think before you utter these words?
    Using women and children for terrorism? Do you even have half a mind to think with? Or has it all be brainwashed in the Zionist propaganda!
    When Israel is KILLIN in cold blood it’s not being viewed as Terrorism. I have no idea how you even call yourself HUMAN. It disgusts me to even think about it. While people who are OCCUPIED and have EVERY RIGHT TO DEFEND THEMSELVES are being branded Terrorists. But the one cowerdly sitting inside his tank Blowing the family of Ghalya to shreads is not a terrorist
    You people disgust me! Seriously. And I don’t mean Jews or Zionists or Israelis. I mean the people who justifyin murdering innocent people while they curse others who are self defending themselves!
    You’re a shame to humanity!

Leave a Comment


NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>