Spam topics
Colin Fahey

1. Introduction

A "spam message", in the context of electronic communication, typically has the following definition:
spam message : an uninvited message sent to a large number of recipients
It is commonly believed that a spam message for a given product or service sent to thousands or millions of people will generate very few actual customers for the sender of the message.  Even if the ratio of attracted customers to total spam message recipients is only 1/10000, or even much lower, the overhead costs of spamming is so low that a net profit is possible.  In fact, for many of the spam products, selling a single product, or scamming a single victim, might be the break-even point for the business model. 
Several factors contribute to the perception of the spam message phenomenon as an important social and technological problem:
(1) Spam messages waste millions of hours of humanity's time each day with the task of differentiating between spam messages and "legitimate" messages; 

(2) Spam messages consume a significant fraction of total Internet bandwidth, which causes both a slowdown of other traffic, and possibly raises overall bandwidth cost; 

(3) Spam messages consume a large amount of storage space on mail servers, sometimes actually making it temporarily impossible for "legitimate" messages to be received; 

(4) Spam messages can be used for campaigns to attempt "identity theft" or other types of fraud.  Spam messages can also be used to propagate computer viruses; 
This document describes the flaws of many of the futile and harmful methods proposed or employed in past attempts to reduce the problems listed above. 

This document offers an alternative solution to the problems of spam messages, such that the solution is simple and reliable, and avoids censorship, and avoids elimination of anonymity, and avoids imposing restrictions, and avoids demanding payments, and avoids centralized services.

2. The original "spam"

"Spam" is a pork and ham food product produced by Hormel Foods ( http://www.hormel.com )
spam_hormel_foods_product.jpg
"Spam Classic is a conveniently packaged canned meat product made of 100 percent pure pork and ham.  Spam Classic contains 180 calories per two-ounce serving.  Spam luncheon meat first was produced in 1937.  It was one of the first convenient, moderately priced and great tasting meat products on the market."
Hormel Foods has written an article commenting on the use of the term "spam" to refer to uninvited e-mail: http://www.spam.com/ci/ci_in.htm
From the article:
Ultimately, we are trying to avoid the day when the consuming public asks, "Why would Hormel Foods name its product after junk e-mail?"
The use of the word "spam" to mean "making insane with relentless, monotonous bombardment" is directly attributed to a "Monty Python's Flying Circus" (humorous BBC television series) skit that celebrates Spam.  A restaurant patron discovers, with chagrin, that everything on the menu contains Spam.  For example, "[...] spam spam spam egg and spam; spam spam spam spam spam spam baked beans spam spam spam...".  The mention of Spam rouses Viking restaurant patrons to begin singing: "Spam, spam, spam, spam. Lovely Spam, wonderful Spam!"  The whole experience causes the frustrated restaurant patron to become insane. 
spam_monty_python_flying_circus_spam_skit.gif
"Spam" skit by "Monty Python's Flying Circus"
A detailed article on the "Origin of the term 'spam' to mean [network] abuse" was written by Brad Templeton : http://www.templetons.com/brad/spamterm.html

3. The reason for this document

I am very concerned by various "solutions" to the spam phenomenon that involve the following:
(1) Invasion of privacy; 

(2) Censorship; 

(3) Payment and cooperation with a commercial entity; 

(4) Making certain types of Internet activity illegal; 
The day I started writing this article (2004.03.29) I heard a report on the BBC World Service (rebroadcast on a local public radio station), featuring an interview with a person affiliated with a company that was offering a "new kind of spam filtering" on a paid membership basis. 

The method relied on monitoring Internet traffic, searching for "identical" e-mail messages coming from a common source.  Suspected spam e-mail is analyzed further to discover any links to Internet sites previously associated with spam efforts. 

This service, and similar mechanisms, will fail due to various scenarios described in this document. 

However, my concern when hearing proposals similar to the one mentioned in the news broadcast is that the public will embrace the proposed solution without fully considering the consequences, which might involve: invasion of privacy, censorship, corporate interests, or making certain kinds of Internet activity illegal.

Clients of various Internet services are subject to the contracts of the service providers.  I have no complaint with that because I can choose to avoid service providers with terms I do not like.  My concern is that the current conversations about solutions to the spam phenomenon will lead to a wide acceptance of terms that go against principles I consider important.  I believe that a significant fraction of the people who would willingly accept such terms might not be so accepting if the impact of such terms on privacy, freedom of communication, and freedom from the influence of corporate interests, are described in a way that makes the issues very relevant and personal. 

4. Gallery of spam messages

This section presents contemporary examples of spam, with some analysis and related information.  Although this section is based on spam I have personally received, I believe my experience is typical of users of e-mail. 

This section is intended to sketch the basic principles of spam.  An attempt at a formal definition of the term "spam" will be postponed until the next section.  Presenting examples in this section will make subsequent formal discussion less abstract. 

4.1 Spam messages which I have received

Over the past several months I have received an average of approximately 100 uninvited messages each day, and I generally receive several computer viruses as e-mail message attachments each day. 

Earlier this year, from 2004.01.15 through 2004.02.8, a period of 25 days, I received 2872 spam messages, of which 207 were computer viruses; which corresponds to an average of 114 spam messages each day, and an average of 8 computer virus attachments per day. 
spam_typical_inbox.jpg
A portion of my e-mail "Inbox" on 2004.03.29 as displayed by the "Microsoft Outlook Express 5" computer program.  On this date I received 9 "legitimate" messages, 77 spam messages, and 2 computer virus attachments.

4.2 The sender name and the message subject of a spam message

One of the striking features of most spam messages is that the disingenuousness starts almost immediately with the alleged sender's name.  The fact that almost every spam message has a fake sender name cheapens the whole concept of the sender name.  Of course that is merely the beginning of the erosion of trust, but I nonetheless pause and consider the bizarre act of a spammer producing a fake sender name.  Spam messages promoting "male sexual performance" drugs or pornography often have sender names that are female.

Interestingly, the subject associated with a spam message often really does contain an accurate summary of the spam message.  But, as one can see in the small set of subject items above, some spammers believe that sensible descriptions of e-mail messages are not necessary. 

Eventually both the sender name and subject line will be recognized by the public at large as totally meaningless claims associated with the messages, which is a reflection of the actual technical fact: these fields are totally unreliable for determining the origin and content of e-mail messages. 

4.3 What is the spam all about?

The following table indicates the number of spam messages I received on three recent dates:
(1) 2004.03.29 : 77 spam messages total; 
(2) 2004.03.30 : 98 spam messages total; 
(3) 2004.03.31 : 121 spam messages total; 
The following is an approximate classification of the spam messages I received on those three dates:
MEDICATION:
  -------------------------------------------------------------------
  PENIS-ENLARGEMENT:
    Viagra, Cialis, NaturalGain,
    "Weekend Pill", Viagra Patch:       18/77,  17/98,  16/121
  ALTERNATIVE-SOURCE PRESCRIPTION
    MEDICATIONS/PSYCHOTROPIC DRUGS:
    Levitra, Phentermine, Vicodin,
      Valium, Ambien, Xanax, Tramadol,
      Lipitor, Propecia, Zocor:         14/77,  18/98,  19/121
    Marijuana-like product/
      Mood Enhancers/Herbal Meds:        1/77,   0/98,   0/121
  DIET/NUTRITION:
    Diet Pills/Patch:                    3/77,   3/98,   3/121
    Anti-Aging/HGH:                      1/77,   0/98,   1/121
  SMOKING:
    Cigarettes:                          1/77,   1/98,   3/121
  HEALTH AID:
    Snoring Control:                     1/77,   0/98,   0/121
  -------------------------------------------------------------------
                         TOTAL:  39/77(50%), 39/98(40%), 42/121(35%)


FINANCIAL:
  -------------------------------------------------------------------
  LOANS/CREDIT:
    Refinance Mortgage/Equity Loan:     13/77,  12/98,  11/121
    "Cancel Debt" (somehow):             0/77,   1/98,   8/121
    Car Loans:                           0/77,   2/98,   1/121
    Payday Cash Advance:                 1/77,   1/98,   0/121
    Unsecured MasterCard/Credit:         1/77,   0/98,   1/121
  INVESTING:
    Investor/Stock Alert:                5/77,   5/98,   3/121
  INSURANCE:
    Life Insurance:                      1/77,   1/98,   2/121
    Healthcare:                          1/77,   0/98,   0/121
    Auto/Warranties:                     1/77,   0/98,   0/121
  BUSINESS OPPORTUNITIES:
    "Work" on eBay:                      1/77,   6/98,   4/121
    Own Resort:                          1/77,   0/98,   0/121
    "Network Marketing":                 0/77,   0/98,   1/121
    Real-Estate Auctions:                0/77,   0/98,   1/121
  GAMBLING:
    Poker/"Earn Money Playing Lotto!":   0/77,   1/98,   2/121
  SPAMMING:
    Spam 27 million people:              0/77,   1/98,   0/121
  -------------------------------------------------------------------
                         TOTAL:  25/77(32%), 30/98(31%), 34/121(28%)


SOFTWARE/CONTENT:
  -------------------------------------------------------------------
  PORNOGRAPHY:
    Porn (farm sex, schoolgirls,
      girls gushing, web cam,
      monster cocks):                    1/77,   1/98,   6/121
  PARANOIA/SNOOPING:
    Software to Learn about People:      1/77,   0/98,   0/121
    Scan PC:                             1/77,   0/98,   0/121
    Keyboard Logger:                     0/77,   1/98,   0/121
  PIRACY:
    Cheap software/OS:                   2/77,   8/98,   5/121
    DVD copying:                         0/77,   2/98,   0/121
    Cable Descrambling/
      Free "Pay-Per-View"(!):            0/77,   2/98,   0/121
  -------------------------------------------------------------------
                         TOTAL:      5/77(6%), 14/98(14%), 11/121(9%)


MALICIOUS/FRAUD:
  -------------------------------------------------------------------
  VIRUS:
    Virus (Mail "Delivery Failed" type
      with attachment):                  2/77,   0/98,   1/121
  IDENTITY THEFT:
    Web-based "verification"
      (PayPal,eBay,Fleet Bank):          2/77,   2/98,   0/121
  -------------------------------------------------------------------
                         TOTAL:      4/77(5%), 2/98(2%), 1/121(1%)


MISCELLANEOUS:
  -------------------------------------------------------------------
  Unknown:                               2/77,   6/98,  18/121
  Blind date/dating:                     0/77,   0/98,   5/121
  Earn Degree/Degree without Tests:      0/77,   1/98,   3/121
  "Colin, Grow 2 Cup Sizes -- FREE!",
    Bigger Breast From Pill:             0/77,   1/98,   2/121
  Vacation Deals:                        1/77,   1/98,   0/121
  Your Opinions might make you 1000:     0/77,   1/98,   1/121
  Hair Transplants:                      0/77,   1/98,   1/121
  Misc. Deals:                           1/77,   0/98,   0/121
  Luxury Sheets:                         0/77,   1/98,   0/121
  Free Samsung Mobile Phone:             0/77,   1/98,   0/121
  Hypnotic MP3 for Depression,
    Self-Esteem, Motivation:             0/77,   0/98,   1/121
  Wristwatches (Rolex,etc):              0/77,   0/98,   1/121
  Print Own Postage:                     0/77,   0/98,   1/121
  -------------------------------------------------------------------
                        TOTAL:      4/77(5%), 13/98(13%), 33/121(27%)


SUMMARY:
-----------------------------------------------------------------------
MEDICATION       TOTAL:   39/77( 50% ),   39/98( 40% ),   42/121( 35% )
FINANCIAL        TOTAL:   25/77( 32% ),   30/98( 31% ),   34/121( 28% )
SOFTWARE/CONTENT TOTAL:    5/77(  6% ),   14/98( 14% ),   11/121(  9% )
MALICIOUS/FRAUD  TOTAL:    4/77(  5% ),    2/98(  2% ),    1/121(  1% )
MISCELLANEOUS    TOTAL:    4/77(  5% ),   13/98( 13% ),   33/121( 24% )
-----------------------------------------------------------------------
                 TOTAL:   77/77(100%*),   98/98(100%*),  121/121(100%*)

                (*...Percentages in this table are rounded and do not
                     add to 100% with shown precision.)
Analysis
Medication is the most frequent topic of spam messages during this three-day sample.  Two types of medication supply services dominate in this category of spam messages: (1) Penis enlargement; (2) General pharmacy "needs" (often drugs that are expensive in the domestic US market, and drugs which reputable doctors might be hesitant to prescribe due to lack of medical justification and potential for abuse).  Spam promoting penis-enlarging drugs are typically very informal, using phrases like: "Haha, U Have A Real Small Pe-nis", "Is Your Me.mber too Teeny?", "Screw ur lover like never before", etc.
Financial topics were very common among the spam messages during this three-day sample.  Home mortgage loans and refinancing offers dominate this category of spam messages.  Investor "stock alerts" are also common.  During this period, the "making a fortune on eBay" plan was significantly promoted.  My personal favorite scam concept in this category arrived with the subject: "Earn Money Playing Lotto!"
Software and media content are popular spam topics.  Offers of inexpensive software dominate this category; there is no doubt that this software is pirated, despite explanations of how, for example, one can buy Windows XP for $32 USD instead of paying $286 USD.  Spam promoting pornographic web sites is also common in this category.  My personal favorite offer is for a product that will give a person "Free [Pay-Per-View]" -- an oxymoron if one doesn't consider the fact that the product itself actually costs money.  Another really interesting sub-category in spam regarding software products is software designed to address a person's paranoia -- such as software to scan a person's personal computer (PC) for "spyware", or software to spy on children and spouses using the family computer, or software to learn about public records on others (or oneself!).  The irony is that installing such software will lead to the very things the target spam recipients fear most. 
Of the miscellaneous topics of other spam messages, alleged "blind dates" are frequent, along with offers to earn various college degrees (often by only paying a small fee; no testing or qualifications necessary!).  My personal favorite is an offer with the subject: "Colin, Grow 2 Cup Sizes -- FREE!".  I don't think breast enlargement is a good idea for me! 

4.4 Notable spam from the years 2001-2003

The following images are from spam messages I received during the years 2001-2003.
spam_2001sep21_political.jpg
I received this spam message on 2001.09.11, 10 days after the World Trade Center buildings were destroyed by fires after airplanes were intentionally crashed in to the buildings by terrorists. 

This spam message, offering, among other things, a bumper sticker that advocates a plan to "Nuke Afghanistan", demonstrates that spam can be very political.  Following the US initiation of the war on Iraq in 2003, spam offering "Terrorist 'Most-Wanted'" playing cards, depicting 52 people targeted by the US anti-terrorism effort, arrived almost daily in my e-mail "Inbox" for many months. 

It is important to be aware that some spam is motivated by social or political interests.  Such spam benefits an idea or a social agenda, rather than an easily identified business or person. 
spam_2002_funny_camera_ad.jpg
This creepy spam message, like most spam messages, addresses some type of personal insecurity.  This same product, in a funny coincidence, was also promoted elsewhere as a way to spy on naked women, which implies that this device is a method of violating personal security! 
spam_2002_viagra_racy_ad.jpg
This spam message, which I received in the year 2002, makes an indirect reference to the film "The Matrix": 

[Morpheus offers Neo a choice between two pills: a blue pill and a red pill.]  "You take the blue pill, the story ends, you wake up in your bed and believe whatever you want to believe."  "You take the red pill...and I show you how deep the rabbit hole goes."

Although in the film it is the red pill (and not the blue pill) that will result in being shown "how deep the rabbit hole goes", the humor of this spam message for Viagra is not diminished. 

Compared to other penis-enlargement product spam messages of 2003 and 2004, this spam message is fine art! 
spam_funny_iambigbrother.jpg
This spam message, which I received some time in the year 2002, is the most outrageous invitation for irony I have ever seen.

The idea of promoting a sense of security by installing an application that is completely invisible and secretly records all instant messages, all chat, all e-mail, all web sites visited, etc, is perverse.  Naming the software "IamBigBrother" (I am big brother) is hilarious! 
spam_final_norton_internet_security01.jpg
The product which a person might receive after responding to this offer is likely to actually contain computer viruses rather than prevent them! 

However, even more hilarious is the hypocrisy of a spam message that implies that the sender of the message wants to help the recipient of the message reduce unwanted e-mail messages! 

4.5 Examples of spam from the year 2004

The following images are from spam messages I received during the year 2004. 
spam_final_funny_outrageous_taunts.jpg
This spam message is an invitation to start a career in sending spam messages!  (Obviously, visiting the specified Internet site address could invite computer viruses, such as spyware or a trojan spam mailing program.) 

I love the bravado of the author of this message!  This message seems very "human" to me, with its boldness and its desperation.  I resonate with the emotional content of the message, even though I am not interested in the idea of the message.
spam_final_email_27_million_people.jpg
This spam message promotes a service to enable a person to send spam messages to "27 million people".  The fact that I received this message is itself evidence (at least 1/27000000 of a full proof) that the sender of the message can do what is promised.
spam_final_funny_yet_another_domain_name.jpg
I like the domain name: "YetAnotherDomainName.com" (created 2004.01.29, and resolving to 216.177.88.181 at the time of this writing, 2004.04.03) 

This example of a domain name reflects the spirit in the spammer realm, where purchasing disposable domain names to launch the next spam campaign is a small price to pay to avoid spam obstacles.  Creating new domain names for each short-term spam campaign helps avoid "IP address blacklisting", or cancellation of service by the Internet hosting provider (who discovers, too late, that a host was rented for use in a spam effort). 

Making Internet domain name allocations more difficult does not solve anything, and instead makes anonymity and free speech more difficult to maintain on the Internet.  The solution to spam has nothing to do with restricting traffic that flows on the Internet, but instead has to do with detecting human senders and approved senders. 
spam_final_funny_no_work.jpg
I like this one.  Simply enroll in the program and start making money -- while doing "Absolutely Nothing!".  Scams are based on greed, and this example is one of the purest appeals to greed that I have ever seen. 
spam_final_help_fellow_spammers.jpg
This spam message promotes a book which might not actually exist. 

I dislike the items "How to get fake identity documents" and "How to hack into other [people's] computers remotely".  However, I believe that there should be no restrictions on the possession and distribution of information that is not associated with individual persons. 

I also believe that being able to do something privately or anonymously is an important part of human justice and progress.  A democracy would be severly compromised if people could not vote in private, because only with privacy can a person vote entirely in accordance with the person's own beliefs.  Similarly, only with an assurance of privacy can a person explore ideas without fear.  Therefore, when technology enables governments and corporations to monitor the thoughts or actions of individual persons, I believe that humans are entirely justified in pursuing methods to avoid being monitored. 
spam_regretful_of_having_little_diccky.jpg
This spam message teases me with one of the central mysteries of spam: How could anyone buy medication (which enters and affects a person's own body) from people who think it is acceptable to use a fake sender name, and use a humiliating taunt as a subject, and make numerous spelling errors in the promotion, and conclude with a collection of random words?!? 

However, I try to consider another perspective.  Suppose an honest person doesn't believe that certain medications should be restricted by laws.  What method other than spam messages could be used to access a market that is otherwise closed by the government?  This thinking makes me think that spam might be one of the ultimate examples of freedom. 

4.6 Examples of attempts of fraud and "Identity theft" from the year 2004

The following images are from spam messages I received during the year 2004, showing examples of attempts of fraud and "Identity theft". 

The basic idea is to convince the message recipient that it is necessary to gather personal information, often to "prevent an account from expiring", or for the recipient's "security" and "protection".  This is pure irony, because providing the requested information would eliminate security and cause account trouble. 
spam_final_scam_paypal.jpg
This spam message is admirable in its professional appearance and for its outrageous inclusion of phone numbers and Internet site addresses to help the victim gather information to be robbed more efficiently and completely. 
spam_final_scam_paypal_code.jpg
The PayPal scam e-mail message in the previous image includes the JavaScript code shown in the image directly above.  This JavaScript code repeatedly writes the text "http://www.paypal.com" in to the browser status bar (lower-left border of Internet Explorer, for example).  Thus, when the user hovers the mouse cursor over the critical links in this spam message, the actual link (which would be a hint that this is a scam) is quickly clobbered by the text "http://www.paypal.com".  Only someone watching the status bar carefully while moving the mouse cursor would see the brief flash of the real Internet site address.  Future browsers will probably eliminate this obvious kind of abuse. 
spam_idtheft_citibank.jpg
This "identity theft" scam, masquerading as a message from Citibank, which I received on 2004.04.04, makes a direct request for a debit-card personal identification number (PIN), which can be used to withdraw cash. 

This is so unprofessional that it is absolutely hilarious!  However, preparing this message and Internet site did indeed require some skill and effort.  So, I am confused.  Why not try to spell words correctly in the message?  Was this message secretly sent by banks, to their customers, to determine how gullible each customer might be?  Maybe clicking on the Internet link automatically reduces a person's "credit score". 
spam_idtheft_citibank_analysis.jpg
The "Citibank" scam spam message in the previous image has the HTML code shown directly above.
spam_final_scam_ebay.jpg
This eBay scam is not as elaborate as the PayPal scam above, but it probably looks sufficiently professional to be effective. 

4.7 Examples of computer virus message attachments from the year 2004

The following images are from spam messages I received during the year 2004, showing examples of messages with computer virus message attachments.  If I had a spare computer I would be tempted to download as many computer viruses as possible and have all the computer viruses fight for control of my computer's resources.  "Ready...FIGHT!" 

There was once a computer virus that included a popular anti-virus program as part of its code, so that it could eliminate competing viruses on the computer, and would thus be able to more efficiently do its job of sending spam messages!  Hilarious!  The fact that the virus includes the stolen anti-virus software seems to validate in some small way the effectiveness of anti-virus software, but, at the same time, the fact that the anti-virus software is used merely as a part of a virus is really perverse.  (There must be examples of this strategy in biological organisms, such as bacteria that exude chemicals that we might generally regard as "antibacterial" with the result that there are no other bacteria competing for the available resources.) 
spam_ie_virus_attachment.jpg
I must say, this message with a computer virus attachment is a contemporary classic.  I've never actually tried getting infected by the computer virus, to determine if it suited my lifestyle, but, hey, "500 000" people can't be wrong!
spam_paypal_virus_attachment.jpg
Wow!  This computer virus message attachment has quite a background story!

4.8 Examples of simple obfuscation from the year 2004

The following images are from spam messages I received during the year 2004, showing examples of messages with simple obfuscation. 
spam_text_trivial_obfuscation.jpg
This is a trivial form of obfuscating the content of HTML, to thwart message filters based upon text analysis.  The fake HTML tags divide the text that will ultimately appear in the HTML document, making it difficult to determine the text that will actually be seen on the computer screen.  One countermeasure is to eliminate HTML tags, and another countermeasure is to somehow consider the visual effect of HTML tags before scanning for spam-indicating words.  However, such countermeasures only solve one of the many basic ways that spam can defeat any attempts at automated filtering based on message text. 

4.9 Examples of Unicode character abuse from the year 2004

The following images are from spam messages I received during the year 2004, showing examples of Unicode character abuse. 
spam_final_real_unicode_characters.jpg
"Unicode" characters allow the characters of major world languages to be encoded in files and data streams, such as HTML documents encoded with UTF-8.  The spam message shown above, which I received in 2004.03, shows a conventional use of Unicode characters -- in this example, to represent letters of the Russian (Cyrillic) alphabet. 
spam_final_unicode_characters1.jpg
People who send spam messages have found another use for Unicode characters: displaying characters that look like English letters, but in fact are letters and symbols from other world languages.  Thus, English readers, humans, have no trouble reading the text visually, but automated text scanners will fail to detect the presence of "spam-indicating" words. 

One solution is to build up a table of how Unicode characters visually relate to English letter and number characters.  But, given the large number of Unicode characters that are "visually compatible" with various English letter and number characters, this effort is likely to be impractical.  Combine this with strategic misspellings and random interjection of punctuation, and the text filters are doomed to fail. 

I suppose an isolationist American could block all e-mail containing Unicode, but even plain English characters can be used in creative ways that humans have no trouble reading but create an intractable problem for text scanners.  A filter which rejected ungrammatical text, or which rejected text with many misspellings, would likely block a large fraction of "legitimate" messages!  Spelling and grammar have been going out of style ever since they were invented! 

4.10 Examples of messages with text intended to thwart filters that are based on statistical text analysis, from the year 2004

The following images are from spam messages I received during the year 2004, showing examples of messages with text specifically intended to thwart filters that are based on statistical text analysis. 
spam_final_off_topic_text.jpg
This spam message includes a paragraph from a formal text.  This example includes text from a Travel Warning issued from the United States Department of State on 2004.03.23 : http://travel.state.gov/israel_warning.html (an Internet search for "curfew should remain indoors" revealed the source of the text). 

The meaning of the added text is not as important as the fact that the text is: grammatical, potentially interesting or important to the recipient, and has enough words to greatly "outweigh" any spam indicators that might be detected elsewhere in the message. 

4.11 Examples of messages with base-64 encoding, from the year 2004

"Base-64 encoding" is a method of representing sequences of byte values by a sequence of ASCII characters within the following set of 64 ASCII characters: 
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
Thus, the ASCII character 'A' corresponds to an integer value 0 (zero), and the ASCII character '/' corresponds to an integer value 63 (111111 in binary).  Groups of three bytes from an input sequence are regarded as a sequence of 24 bits.  Four 6-bit values are extracted and converted to the corresponding characters in the set above.  If the input sequence has a total number of bytes that is not a multiple of three, the input sequence is appended with bytes of value 0 (zero), and the output sequence is appended with '=' characters. 

Base-64 encoding is typically used to enable binary data, such as binary file message attachments (with file types ZIP, JPG, MP3, DOC, EXE, etc), to be contained in the plain-text body of an ordinary e-mail.  Thus, text-based operations can be conducted on mail archives without worrying about encountering non-ASCII characters, or problematic ASCII "control characters" such as 0 (Null, NUL, 0x00, ^@), and 4 (End of Transmission, EOT, 0x04, ^D). 

However, people who send spam messages have used base-64 encoding as a simple method to obfuscate their HTML content.  Thus, very simple text filters, or human readers, cannot easily examine the content of such spam messages.  It would be simple to add a base-64 decoding stage to a spam filter so that the filter could analyze messages with base-64 encoding, but this is yet another example of the unlimited complexity of automated spam detection.  Filtering spam using message analysis is futile. 
The following C code compiles to a very simple base64-to-text conversion program.  A person must manually place a base-64 block of text, by itself, in a text file, and then use this utility to generate text output.  The output can be directed to an output file by operators on the command line.  I wrote this code as a simple demonstration. 
// Convert base-64 to plain text (Usage: base64decoder.exe [file name])
//
// The specified file must only contain a block of base-64 data
// and optional whitespace (space, carriage return, newline, tab).

#include <stdio.h>   // printf(), fopen(), fseek(), ftell(), fread(), fclose()
#include <malloc.h>  // malloc(), free()

int main ( int argc, char * argv[] )
{
    char * base64Table   =
    "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";

    if (argc!=2){printf("USAGE: %s [filename]\n",argv[0]); return(-1);}
    FILE * fp = fopen( argv[1], "rb" );
    if (NULL==fp){printf("ERROR: Failed to open:%s\n",argv[1]);return(-2);}
    fseek( fp, 0, SEEK_END );
    int fileSizeInBytes = (int)( ftell( fp ) );
    fseek( fp, 0, SEEK_SET );
    if (fileSizeInBytes <= 0)
        {printf("ERROR: Seek failed in:%s\n",argv[1]);fclose(fp);return(-3);}
    char * fileData = (char *) malloc( (size_t)(fileSizeInBytes) );
    if (((char *)(0)) == fileData)
        { printf( "ERROR: Allocate %d bytes failed.\n", fileSizeInBytes );
    fclose( fp ); return(-4); }
    fread( ((void *)(fileData)), 1, fileSizeInBytes, fp );
    fclose( fp );

    int count = 0;
    int indices[ 4 ];
    char out3 ];
    for (int dataIndex = 0; dataIndex < fileSizeInBytes; dataIndex++)
    {
        char in = fileData[ dataIndex ];
        int found = (-1);
        for ( int trial = 0; ((trial < 64) && ((-1)==found)); trial++ )
            { if (base64Table[trial] == in) found=trial; }
        if ('='  == in    ) { indices[count] = 0;     count++; }
        if ((-1) != found ) { indices[count] = found; count++; }
        if (4 == count)
        {
            out[0] = (char)((indices[0]<<2)&0xff)|((indices[1]>>4)&0xff);
            out[1] = (char)((indices[1]<<4)&0xff)|((indices[2]>>2)&0xff);
            out[2] = (char)((indices[2]<<6)&0xff)|((indices[3]   )&0xff);
            printf( "%c%c%c", out[0], out[1], out[2] );
            count = 0;
        }
    }

    free( (void *)fileData );
    return0 );
}
Download a project for Microsoft Visual C++ 2005 that includes the source code: 
base64decoder.zip
C code for base-64 decoder, for Microsoft Visual C++ 2005
78138 bytes
MD5: 8401dde35a54fbaa1b7db0a6e2c3147f
The following is a C# code version of the C program. 
// Convert base-64 to plain text (Usage: base64decodercs.exe [file name])
//
// The specified file must only contain a block of base-64 data
// and optional whitespace (space, carriage return, newline, tab).

namespace base64decodercs
{
    class Program
    {
        static void Main( string[] args )
        {
            if (args.Length != 1)
            { System.Console.WriteLine( "Specify file name" ); return; }

            try
            {
                System.String fileText = System.IO.File.ReadAllText( args[0] );
                byte[] b = System.Convert.FromBase64String( fileText );
                System.String outputText = System.Text.Encoding.ASCII.GetString( b );
                System.Console.WriteLine( outputText );
            }
            catch (System.Exception exception)
            {
                System.Console.WriteLine( exception.ToString() );
            }
        }
    }
}
Download a project for Microsoft Visual C# 2005 that includes the source code: 
base64decodercs.zip
C# code for base-64 decoder, for Microsoft Visual C# 2005
6697 bytes
MD5: e3c424906bf95c7f6b28e23b5fc4e324
The following images are from spam messages I received during the year 2004, showing examples of messages with base-64 encoding. 
spam_base64_example_mail.jpg
This is the plain-text appearance of a spam message containing HTML encoded as base-64. 
The following shows the decoded version of the base-64 data. 
spam_base64_example_decoded.jpg
This is the decoded version of the base-64 data that was contained in a spam message. 
The decoded base-64 data reveals the web site address promoted by the spam message.  The decoding also reveals random words designed to "dazzle" Bayesian spam filters.  It is important to notice that this spam filter countermeasure was placed within a base-64 encoded block -- which means that the person who sent the spam message assumes that base-64 blocks might be decoded and analyzed. 

However, the most important thing to observe about this spam exampls is that the spam message is totally contained in the image specified by the HTML tag.  Thus, unless a content-based filter notices the "tabs" and "pills" parts of the Internet site address and file paths, this message is totally benign.  It is trivial to eliminate any evidence that the message is spam.  A person could block messages that only contain HTML image tags, but the risk of blocking "legitimate" messages that only contain images is too significant. 

Also, if a client program downloads the image from the server, then the person who sent the spam message knows that a specific recipient of the message exists and has actually viewed the message, at a known time, from a known IP address (and, therefore, an approximate geographical location).  This information is revealing, even though the recipient of the message did nothing more than look at the message.  More secure message programs have options to disable previewing of images contained in messages, thus avoiding giving away information to other people. 

4.12 Examples of determining the origin of a spam message, from the year 2004

In some situations it might be possible to determine the origin of a spam message.  However, the origin of a spam message might not be useful information.  For example, some spam messages are sent by computer viruses located on thousands of random infected machines around the world.  In such a situation, a spam message might have originated from a computer whose owner is totally unaware that the computer is involved in sending spam messages. 

Also, the method of determining the origin of a spam message described in this section is only useful when the spam message includes an Internet address that is actually associated with the person or company that sent the message.  In some situations, the spam message might not be authorized by the owners of any of the Internet addresses mentioned in the message.  In other situations, the spam message might be made to appear to be sent on behalf of a particular person or company, but was in fact sent by an unaffiliated party -- sometimes with the intention of making the apparent sender seem unethical, or sometimes with the intention of distorting the opinions and goals of the apparent sender.  There are many possible reasons why the Internet addresses appearing in a spam message might not have any relation to the actual sender of the spam message. 

The method of determining the origin of a spam message described here is not likely to be reliable.  However, the method described here is very easy to do, and the resulting information might be useful. 
As an example, consider the following spam message, which I received in the year 2004. 
spam_piracy_buye_soft_biz_spam.jpg
Example spam message, with HTML code having links to "buye-soft.biz"
Looking at the HTML code for this spam message reveals links to the Internet domain "buye-soft.biz". 
A person can learn about a registered Internet domain, such as "buye-soft.biz", by doing a "domain name registration" query.  This was historically called a "whois" query.  The InterNIC Internet site is one of many sites offering a "whois" query service: http://www.internic.net/whois.html 
The following image shows the results of a "whois" query for information about the domain "buye-soft.biz", performed by the InterNIC Internet site. 
spam_piracy_buye_soft_biz_whois.jpg
Results of a "whois" query for information about the domain "buye-soft.biz", performed by the InterNIC Internet site. 
I assume all of the information returned by this "whois" query is bogus -- except for the trivial details: domain name, domain ID, domain status, registrant ID, name server, created by registrar, last updated by registrar, and the dates. 

The only information that I think is interesting is the "domain registration date", which in this case indicates that the domain was created less than one week prior to my receipt of the spam message that contained links to an Internet site within this domain. 

This is a common pattern: (1) Register a new Internet domain name ($5 USD); (2) Start a web server, and have the new domain information refer to new server ($10 USD); (3) Wait 48 hours to ensure that the new domain information has enough time to propagate to various domain name service (DNS) servers around the world; (4) Send thousands or millions of spam messages, such that each message has links to the new domain name. 

Thus, for a very small price, a person can establish a new Internet domain name, and can start a web server, and can send thousands or millions of spam messages -- all for an isolated attempt to spread a message, or to collect money, or to collect information, etc.  The person does not need to worry about the future of the domain name or of the web server.  Simply spreading the spam message might already, by itself, easily justify the small cost.  However, even if the spam campaign relies on the web server remaining active for at least several days -- so that money or information can be collected from people -- there is only a small chance of being stopped by complaints to the web site hosting provider or being stopped by law enforcers.  If the web server is stopped, then the person who sent the spam message can simply pay for more domain names and more server hosting contracts.  In many cases, if only a single person in the whole world pays money to the person who sent the spam message, then the whole cost of the spam process might be worthwhile! 

The obvious responses to this abuse of the Internet is to attempt to make the process of registering domain names more difficult, and to attempt to make the process of buying web hosting contracts more difficult.  However, those responses would be futile, and would hurt more people than it would help. 

5. Definition of "uninvited message"

Consider the following simple definition of a "spam message": 
spam message : an uninvited message sent to a large number of recipients
That definition depends on the definition of an "uninvited message". 
Messages from family, friends, and acquaintances, are implicitly "invited". 
If a person broadcasts a message inviting feedback from the public, such as inviting people to add messages to an Internet forum (e.g., a blog), then there is an explicit invitation for messages from the public.  However, typically there is also an implicit expectation that the messages will not be advertisements, and will not be extremely irrelevant, and will not interfere with the ability of other people to enjoy using the Internet forum (i.e., the messages will not be enormous, and will not contain computer viruses, and will not be disgusting to the sensibilities of the forum community, and will not be inflammatory or hateful).  Some Internet forums explicitly specify the expectations or rules of using the forum. 
If a person wishes to use an Internet service (such as being allowed to submit messages to an Internet discussion forum, etc), the service provider often requires the person to submit personal information, such as the person's e-mail address.  The process of submitting personal information to get access to a service is often called "registration".  Sometimes the service provider will verify the validity of a person's e-mail address by sending a message to the specified e-mail address and requiring an indication that the person received the message (by clicking a link with unique characteristics within the message, or by submitting unique information contained within the message). 
If an Internet service provider requires a person to submit a personal e-mail address as a requirement of using a service, then the service provide might eventually send messages to the person.  Some service providers clearly indicate how they will use any information that the service provider collects from a person.  Sometimes the service provider will allow the person to choose whether or not the service provider is permitted to send messages to the person.  However, in some cases, the service provider's description of how they will use personal information is vague or ambiguous.  Also, unfortunately, sending e-mail "notifications" and advertisements (or "offers") from "partners and affiliates" is often part of online service contracts.  Therefore, some people create temporary e-mail accounts specifically for the purpose of registering for Internet services, and thus avoid any abuse of the trust between the person and the service provider. 
Abuse of the interpretation of "opting in"
During the years 2000-2002, many spam messages contained text similar to the following disclaimer: 
"This message is not spam.  You are receiving this message because you requested this message from this service, or opted-in to mailings from one of our affiliates." 
It is not difficult to imagine that a giant corporation might have a slightly less monitored affiliate with slightly lower standards of business ethics.  And it is easy to imagine that through the corporate equivalent of "Six Degrees of Separation" (i.e., the theory that any person on the planet is connected to any other person on the planet by way of, at most, six personal relationships) that eventually personal data submitted to, say, any giant corporation might actually, through a chain of "affiliation", be accessible to any arbitrary business, or outright criminals, on our planet. 

6. Spam messages in other media types

It is interesting to consider that billboards on the sides of freeways, busses, and taxi cabs, might qualify as a kind of government approved and socially approved visual spam.  However, I believe that if a proposition to eliminate all billboards were placed on a state ballot, the overwhelming majority of people would vote in favor of the proposition.  The fact that billboards clutter the visual spaces in many cities proves that there is an ideological gap between the local governments and their constituents. 

Billboards radiate data via photons in all directions without regard for the wishes of potential recipients.  Audio loudspeakers radiate data via sound waves radiating in many directions without regard for the wishes of potential recipients.  Postal mail bulk advertisements can be regarded as a physical form of spam messages.  Some spam messages have been transmitted to facsimile machines.  Automatic telephone dialers with recorded messages have been used to send audio spam messages directly in to individual homes. 

Some of these "spam message" variations rely on the proximity of potential recipients, and thus the "technology" to avoid such spam messages is simply to move away from the emitter.  But other variations of spam messages essentially bring the message very close to the target recipients, such as spam messages sent by postal mail or by a telemarketing telephone call.  Here, the "technology" to avoid being distracted with the task of differentiating between spam messages and desired messages is limited to "requests to block bulk postal mail" and registering with the "national 'do-not-call' list" (which relies on vigilant consumers, and laws to act as a deterrent for would-be violators). 

7. Definition of "message"

Obviously, when defining "message" for the purposes of defining "spam message", the definition of "message" must be based upon the idea of the "intention" of the data received by each recipient.  Otherwise, simply permuting the sentences of a message might be regarded as producing a new, distinct "message". 

Almost all spam messages today that rely on plain text to convey the message (instead of using images to convey the message) contain procedurally generated text that is unique per recipient. 

Some of this procedural text is comprised of words randomly selected from a dictionary (to defeat word-frequency filtering, or word-pair Bayesian filtering), or is comprised of random grammatical sentences (to defeat filters that check for basic grammar).  Some of this procedural text is comprised of paragraphs of text selected randomly from various sources, including from Internet news sites, and reference articles (e.g., from Wikipedia), and classic texts (e.g., from Project Gutenberg or the Bible); i.e., text which will outweigh any "spam indicators" found elsewhere in the spam message. 

Also, the core text of a spam message (i.e., the intended communication of the spam message), can be procedurally modified to be unique per intended recipient.  This can occur at the character level, and word level, and sentence level, and paragraph level.  Misspellings can be introduced, especially using look-alike characters (e.g., '0' versus 'O'; or using the many similar-looking characters in the Unicode character set).  Transposing adjacent letters within a word ("Viagra" versus "Vaigra") will not interfere with human understanding, but will increase the work required to detect "spam indications".  Sentences can be permuted to further complicate any attempt to recognize similar messages. 

Obviously, defining "message" as a literal sequence of characters (or bytes) cannot be used to define "spam message" because every single spam transmission intended for each individual recipient can contain unique text, despite the fact that all of the spam transmissions are intended to convey the same "message" or "idea". 

8. Methods which fail to significantly reduce the amount of spam messages

8.1 Laws

The creation of laws to indirectly cause the reduction of the amount of spam messages is probably based on the following assumptions: 
(1) The existence of tough laws against sending spam messages will serve as a sufficient deterrent for potential senders of spam messages;

(2) The person responsible for sending the spam messages can be identified;
Reasons why laws cannot significantly reduce spam messages include:
(1) Spam messages can originate in countries which do not have laws against spam messages.  Or, spam messages can originate in countries which do not have sufficient resources to enforce laws against spam messages. 


(2) Spam messages can originate from any of the billions of people on our planet with access to the Internet.  Although laws might be a sufficient psychological deterrent for the vast majority of the world population, only a few courageous people are required to generate billions of spam messages per day. 


(3) The connection between businesses and spam message campaigns will be increasingly difficult to make, especially if there are a few cases in which competitors or hackers seek to implicate a company as a sender of spam messages (by secretly initiating a spam message campaign on that company's "behalf").  Such a scenario, among others, would introduce doubt regarding the simplistic argument that the person or company which benefits from a spam message must have sent the spam message.  There is, rightly, a large amount of plausible deniability in this context. 

In particular, laws cannot prevent spam message compaigns started spontaneously by a person on his or her own personal initiative on behalf of a political agenda, or on behalf of a publically traded stock, or on behalf of a social agenda, etc.  The person can initiate the spam message campaign totally anonymously, possibly by distributing computer viruses which will eventually transmit spam messages.  A person with a modest amount of programming ability can compromise an e-mail system and use it to spread a message that was not endorsed by the organization that might ultimately benefit.  A person who initiates such a spam message campaign can promote an agenda, which, if successful, will somehow indirectly benefit the person who initiated the spam message campaign.  For example, if the spam message campaign promotes a publically traded stock in which the person who initiated the campaign (while remaining totally anonymous) is invested, then the person will benefit from any increase in the stock price, as will thousands of other unrelated people invested in the same stock.  Such spam message campaigns achieve a kind of "advocacy laundering", relying on the large number of potential advocates and the large number of potential beneficiaries. 


(4) Although this is more of an observation than an explanation, it is interesting to consider that many spam messages are far more illegal than simply wasting people's time with unwanted advertisements.  Spam messages can be used to distribute viruses (for mere destruction, or for surveillance and spying, or for propagating more spam messages, or for using computing resources to solve difficult computing problems).  Other spam messages are used as part of an "identity theft" campaign (such that the spam message directs a person to a fake clone of a banking Internet site or commerce site (e.g., eBay) which requests and collects confidential information, often, ironically, with the claim of "increasing security" through "verification").  The people who send such spam messages are already aware that they are disobeying laws!  Indeed, such people hope to commit crimes which are far worse than the crime of sending spam messages.  Therefore, laws against sending spam message will not serve as a deterrent to such people. 

8.2 IP address blacklists, or sender e-mail address blacklists

The following describes the idea of an Internet e-mail blacklist service:
An Internet e-mail blacklist service is a service that manages a list of IP addresses of servers or Internet domains from which alleged spam messages have been sent recently.  The service offers the list for anyone to download at any time.  A person can download this list from the service daily.  When a person receives an e-mail message, the person can check the IP address from which the message apparently originated, and if the IP address is one of the IP addresses in the list, then the message is classified as a possible spam message.  Also, if a person receives a message and decides that the message qualifies as a spam message, then the person can submit the IP address to the blacklist service, so that the IP address can be added to the current list. 
There are many reasons why any attempt to reduce the number of spam messages using an Internet e-mail blacklist service will fail, and will instead cause many new, extremely bad problems: 
(1) Increasingly, spam messages for a specific spam campaign are not sent by a single Internet server (with a readily identified IP address), but are instead sent by thousands or millions of random personal computers (PC) infected with computer viruses.  The computer viruses are controlled and coordinated to send spam messages for a specific spam message campaign.  In such a situation, a blacklist service cannot possibly do anything good.  A blacklist which included the Internet domains or the dynamically allocated IP addresses of individual people using cable modems at home or at the office would cause widespread, and seemingly random, communication difficulties. 

A computer virus which sends spam messages can be embedded in to any computer program.  Millions of people, and perhaps even billions of people, have computer software which has been downloaded from various Internet sites.  There are thousands of sources for each of the many popular software files on the Internet, and determining the reputation of a site on the Internet is often difficult.  Even Internet sites with good reputations might unknowingly allow some software with computer viruses to be offered for download for some time.  Any of the thousands of popular computer programs offered on the Internet can contain computer viruses that will be used to send spam messages from any infected personal computer.  Also, some people download software that would ordinarily cost money but which has been modified by computer hackers to enable people to use the software without paying -- and, obviously, the modified software can contain a mechanism to send spam messages.  In all of the situations mentioned here, the computer viruses did not need to break through any security defenses that a person might have on their personal computer; instead, the computer viruses are contained in software that was willingly invited in to the computer (and such computer viruses are therefore called "Trojan" viruses).  Relying on "anti-virus" software to detect and eliminate such computer viruses is futile, because new computer viruses are created every hour, and even if an "anti-virus" program consulted an Internet database of viruses every hour, there would still be some time for infections to occur, and viruses could therefore have some opportunity to send thousands or millions of spam messages. 


(2) The previous item (i.e., (1)) makes sending spam messages from a single Internet server (with a readily identified IP address) obsolete.  But even attempts to blacklist individual Internet servers which have been observed sending spam messages is a futile and dangerous idea.  By the time a blacklist is updated to include each new origin of spam messages, the spam campaign will already have finished, and the IP address will be discarded.  Meanwhile, an innocent person can soon inherit the discarded IP address, and it would thus be unfair to continue to include the IP address in the blacklist.  Although there might be a mechanism to protest the inclusion of an IP address in a blacklist, this might be impractical if there are many independent blacklists around the world, each with their own complaint resolution process.  (Nobody would be able to trust a single, centralized blacklist! The temptation to accept bribes to temporarily blacklist domains would be huge! News stories could be suppressed, and hence stock prices or government votes could be influenced, by brief blacklist campaigns.)  Domain name registration is as low as $10, and renting the use of an Internet web server in a data center can be very inexpensive and can be without any significant commitment.  Do a "whois" lookup on the domain name associated with a spam message (if any such links exist in the body of the message), and you might discover that the domain was registered only a few days prior to your receipt of the spam message -- a delay only long enough for the domain name registration to propagate to DNS servers around the world.  Even if the source of a spam message is added to a blacklist on the same day as a spam message campaign, the spam message has already reached most of the intended recipients.  Even if the messaging mechanism relies on the messages being stored on the sender side of the communication, there will still be some time during which people can access the spam message.


(3) Malicious people can cause the blacklist to include arbitrary Internet addresses, and can therefore cause arbitrary, innocent Internet sites or services to be blocked.  By studying how entries are added to a blacklist, a malicious person can invent a mechanism to add arbitrary entries.  A blacklist is a very dangerous idea, because it has the power to block things, without any chance for appeal or review, and often without any evidence that anything was blocked.  Power is always valuable, and there will always be people who would be willing to pay money or trade favors to access power.  There are many people who would benefit financially or politically by being able to block the flow of information, even if the blocking occurs only for several minutes or hours, because after stocks have been traded, and votes cast, the benefit of blocking has already been realized.  Even computer hackers who only want a thrill would attempt to adding anything and everything to any blacklist they could find! 

An automated blacklist system which attempted to determine the beneficiary of a spam message (by identifying all Internet addresses mentioned in a spam message; e.g., by each link URL in the HTML) could be exploited by malicious people by simply sending many spam messages such that each spam message contains links to innocent Internet addresses.  Thus, a malicious person can include links to reputable Internet sites within a spam message, even though the spam message is actually affiliated with any of the mentioned reputable Internet sites. 


(4) Creating a blacklist creates the possibility that legitimate messages will be blocked.  If a blacklist contains invalid entries, or if the entries in a blacklist lack precision, then the effect might be a broad interference with overall communication or a reduction of freedom to communicate.  This would be a disaster scenario. 

Some countries (e.g., China) use Internet blacklists to prevent their citizens from learning the truth about history, and to prevent their citizens from discussing certain ideas. 

Some corporate Internet web sites refuse traffic referred by other competing sites. 

Internet search services, which many people rely on as an "unbiased representation of Internet content", must actually provide a biased view of the whole Internet -- a bias that prioritizes information in accordance with the specific search being performed.  However, such search services might prioritize search results according to more than mere relevance; for example, a person can pay the search service to increase the priority of specific search results.  Also, a search service might be compelled, by domestic and international laws, to block certain results from appearing among the search results. 

Blacklists hurt democracy, because a democracy depends on being able to gather or receive information without any bias in the mechanism of gathering or receiving.  When major "news" sources fail to deliver information without bias, the Internet might be our only recourse in the quest for the facts.  Any mechanism that interferes with such a quest is opposed to democracy.  Even "gatekeepers" with good intentions can cause a condition in which freedom of expression and reception of ideas is greatly diminished without a corresponding degree of benefit from the act of filtering information. 

Blacklists hurt progress in science, and technology, and all other human activities, because progress depends on making accurate or optimal decisions, and making accurate or optimal decisions depends on having the most information available.  Blacklists block some of the available information. 

Thus, using a blacklist to block potential or past sources of spam messages introduces the huge risk of interference with overall communication. 

8.3 Detecting patterns within the text of a message which indicate that the message is probably a spam message

The following describes the idea of detecting patterns within the text of a message which indicate that the message might be a spam message. 
The text of a message is analyzed, to find any patterns which indicate that the message might be a spam message.  For example, if the text describes popular medications or very inexpensive software, then the message might be a spam message.  If the text of the message includes Internet links to servers in foreign countries, then the message might be a spam message.  If the text of the message includes words relating to money, or buying, or selling, or bargains, or discounts, etc, then the message might be a spam message.  If the text includes many misspellings and grammatical errors, then the message might be a spam message.  All of this evidence can be combined to make a guess about whether or not the message is likely to be a spam message. 
Reasons why analyzing the text of a message to guess if the message is likely to be a spam message will either fail or will be futile.
(1) The intended communication might be entirely conveyed by images instead of by text.  In such cases, the text contained in the message, if any, could be designed to appear to be important or interesting. 


(2) Every letter in the text of a message can be converted to a visually similar glyph in the Unicode character set.  Thus, searching for specific words will fail.  However, a human will be able to read the message without any difficulty. 


(3) Some messages describing transactions and commerce are actually not spam messages.  For example, a message describing an upcoming delivery of products that a person has purchased via an Internet store is not a spam message.  A message containing a bank statement or a bill is not a spam message.  Indeed, such messages are among the messages a person regards as most important. 


(4) Judging a message by the number of spelling errors and the number of grammatical errors contained in the text of the message is likely to cause many problems in our modern world!  Adding "LOL", "WTF", "OMG", "t3h", "1337", "t3h", "noob", and ":-)" to the spelling dictionary might reduce the false spam alerts.  However, there are many other ways in which a legitimate message might contain an exceeding number of spelling and grammatical errors.  Therefore, filtering based on spelling and grammatical patterns or errors is unreliable. 


(5) A person wishing to send a spam message to a large audience can do some testing before attempting to send the spam message.  The person can create a test message, and then send it to himself or herself through each of the major spam message filtering services.  If any spam message filtering service marks the message as a spam message, then the person can modify the test message and repeat the experiment.  Eventually the person can discover which words convey the desired idea without triggering the spam filter.  Also, the person can use additional software to ensure that each message is unique (by using random synonyms for many random words in the message text, and by randomly changing the grammar or sentence order within the message).  Such methods defeat any filters that rely on many message recipients receiving messages that are statistically similar (i.e., similar vocabulary, similar grammar, similar frequencies of uncommon words, etc, much like plagarism detectors used by some colleges and universities). 


(6) If a company is performing spam filtering, then the company might be enticed by a payment from another company to not filter messages originating from that other company or its affiliates.  Thus, the effectiveness of the spam filtering is limited by the willingness of companies to make deals with other companies.  In many cases, choosing an e-mail service is essentially choosing one channel of spam instead of a different channel of spam, based on the business deals and affiliations of the companies providing e-mail service. 

8.4 Other undesirable methods of reducing the number of spam messages

(1) Internet packet tracking :  This method does not help identify people responsible for initiating spam message campaigns, because modern spam message campaigns are conducted by thousands or millions of personal computers infected with computer viruses.  The tracking would only implicate the innocent.  Meanwhile, Internet packet tracking would destroy any possibility of doing innocent things anonymously on the Internet.  Anonymity is crucial to human justice and progress.  Internet packet tracking would destroy any trust that anyone might have in the possibility of doing things anonymously on the Internet.  Nobody would trust promises by the government or businesses that tracking records would only be used to solve crimes or would only be used for data analysis of population trends or market analysis. 


(2) Paid postage for sending each e-mail message :  Who is entitled to the money that is collected?  If money is the only obstacle, then some companies would be willing to pay to send their spam messages.  Some people will want to avoid paying the money, and will simply switch to a message system that does not involve paying to send messages -- and the spam message problem will simply move over to the new message system.  Charging money for a specific kind of data going through the Internet sets a dangerous precedent, and it would eventually seem quite natural for companies to charge fees for other kind of data flowing though the Internet.  Companies participating in this paid postage system will likely make deals among themselves to offer reduced postage rates for certain circumstances, and the whole system will thus be corrupted. 

8.5 Regulating communication speed by requiring message senders to compute the solutions of difficult mathematical problems

The following describes the idea of regulating communication speed by requiring message senders to compute the solutions of difficult mathematical problems. 
When a message sender initiates communication with a message receiver (either a receiver that will act as a relay, or a receiver that is the intended ultimate recipient), the message receiver responds with a mathematical problem of a predictable difficulty.  The message sender computes the solution to the mathematical problem and submits the solution to the message receiver, along with the message itself.  If the message receiver verifies that the submitted solution is correct, then the message receiver accepts the message itself; otherwise the message is rejected.  This protocol ensures that a message sender has expended a certain amount of computational resources (mostly CPU cycles) to send each message.  Computations require time to perform, and thus the rate of communication is slowed.  Individual computers would naturally be limited to sending a certain number of messages per hour because each computer can only perform a certain number of calculations per hour.  Thus, the expended computational resources serve as a kind of virtual currency, or as a kind of payment for postage, but with payments that are not collected, but are instead merely verified.  For this protocol to achieve the desired result (i.e., the slowing of communication), the computational tasks must actually represent a significant investment of computational resources on the part of the message sender; otherwise the message sender can easily send many message per hour. 

The message receiver, for example, can multiply two prime numbers together, such that the result has a specific number of digits (e.g., 16 digits or more), and then send the product to the message sender.  The message sender can then do the work to find the two prime numbers which when multiplied together produce the specified product.  The message sender then sends the two prime numbers to the message receiver, and then the message receiver can easily verify that the numbers are correct.  Although the task of factoring a composite number in to prime factors will require a different amount of time for different types of computers, the time required will probably be within a certain acceptable range for most types of contemporary computers.  The computational work keeps the message sender computer very busy, which, in principle, prevents the message sender computer from sending many messages per hour. 
Regulating communication speed in the manner described above could be used to reduce the number of spam messages that a person receives.  However, the need to waste computer resources to send messages is unfortunate.  Also, legitimate bulk mail would need to be given a way to bypass this mechanism, otherwise bulk mail would be impractical; i.e., a bank could not send bank statements to its millions of customers without a large array of computers to solve all of the computational problems posed by the message receivers. 

A better, less wasteful, method of spam message reduction is presented in the next section. 

9. Spam message reduction proposal

9.1 Introduction

The following sections describe a proposal for software and social practices which, when combined, are likely to almost completely eliminate spam messages for everyone. 

9.2 Benefits of the proposed method

(1) Any human can send a message to any other human, even when one human does not yet know anything about the other human.  Spontaneous, anonymous communication is possible. 

(2) A person can choose to receive messages from automated systems (such as messages from banks, and from businesses, and from other organizations). 

(3) A message receiver can easily create channels of trust.  A trusted source of messages can be given access to a channel of trust.  Access to a channel of trust means that a message sender can send a message to a message receiver directly, without any challenge.  A message receiver can easily detect violations of any channel of trust (e.g., a trusted message source evidently sharing access to the channel of trust with an untrusted source).  A channel of trust can be easily be permanently disabled by the message receiver.  Therefore, the practice of companies sharing customer information with affiliates can easily be detected, and complaints can immediately be made, and the violation of trust can immediately be eliminated. 

9.3 Software requirements of the proposed method

(1) The proposed method requires that a message receiver have a personal Internet web site, or a similar Internet web service that is always accessible and which has a persisting name (e.g., a persisting Internet domain name).  Google, Yahoo!, and Microsoft, can also modify their respective e-mail message services to implement the proposed method. 

(2) The proposed method requires automated message sending systems to use a new protocol for sending messages, and requires that the automated systems connect directly with each intended message recipient system. 

w (3) The proposed method requires e-mail client software to be modified to be able to access the channels of trust of the proposed method.  Otherwise, messages cannot be sent freely to intended message receivers. 

9.4 Things which will be made obsolete by the proposed method

(1) The practice of sending spam messages will be made obsolete by the proposed method.  The spam message phenomenon for electronic messaging will end. 

(2) All existing methods of identifying and blocking spam messages will be made obsolete by the proposed method.  For example, IP blacklists (e.g., Spamhaus, etc) will be obsolete.  Algorithms to analyze the text of messages for spam message indicators will be obsolete. 

(3) Laws regarding sending spam messages will become obsolete (in addition to already being futile).  Perhaps such laws will be eliminated. 

9.5 The idea of a generic human skill challenge

The following describes the idea of a generic human skill challenge: 
If a resource, such as a communication channel, is only intended for use by humans, and is not intended for use by automated systems, then a challenge can be designed to determine if a request for access to the resource is made by a human. 

A generic human skill challenge is a task that the overwhelming majority of adult humans can easily accomplish without any prior knowledge about the challenge, while, at the same time, any practical automated system would consume resources (time, space, cash, etc) far beyond the value of the resource protected by the challenge. 
Most Internet forums (e.g., "blogs" and social discussion forums) now use a generic human skill challenge to determine whether or not to accept a submission (e.g., a "post") to the forum.  Such discussion forums are intended for individual humans to participate in text-based conversations with other individual humans.  The generic human skill challenge ensures that automated systems are not able to submit text to the discussion. 

Individual humans can choose to submit text on behalf of an advertising campaign or on behalf of a social campaign.  However, requiring humans to create "accounts" for specific discussion forums, and by requiring a person to wait 24 hours before the account is created, and by limiting the number of account creations per day from a particular IP address, and by putting in one or more generic human skill challenges in the account creation process, the abuse of discussion forums by humans can be reduced. 

The following image shows a specific kind of generic human skill challenge: reading visually distorted letters and digits, and then typing those letters and digits in to a text entry area.  Computers cannot easily or economically perform this task. 
spam_yahoo_mail_registration.jpg
Generic human skill challenges can be compromised by automating a system of getting actual humans to unwittingly solve such challenges on behalf of the automation.  However, recruiting humans to solve such challenges can be costly, and the rate at which a team of humans can solve such challenges is certainly not fast enough to make sending thousands or millions of messages per day possible or economical. 

In the year 2003, some ingenious person found an effective (and hilarious) method of recruiting humans to solve generic human skill challenges and unwittingly help other people to send spam messages.  An Internet site was created which would allow a human visitor to look at a sequence of pornographic images.  After looking at several pornographic images, the human would be presented with a generic human skill challenge, which the human must solve to be given permission to view more pornographic images.  The generic human skill challenges are generally quite easy for humans to solve quickly, and so the humans visiting this particular Internet site would gladly solve the challenges in order to see more pornography.  The generic human skill challenges, however, actually originated from active Internet sessions on other Internet sites, and were simply copied and presented in the context of the pornography Internet site.  When the human solved the challenge on the pornography site, the solution was then submitted to the other Internet site (i.e., the target of a spam message campaign) to do something like create an e-mail account or to submit a message to a forum. 

Despite the possibility of recruiting humans to solve generic human skill challenges on behalf of a spam message campaign, this problem will never be very extreme.  This is because of the limited number of humans on the planet, and because of the relatively slow rate at which humans can solve the challenges, and becuase of the costs required to entice or employ humans to solve the challenges.  (Thus, it would be a mistake to create a particular generic human skill challenge that was amusing and enjoyable, otherwise solving them would become a form of recreation!) 
* * * * * THIS DOCUMENT IS NOT YET COMPLETE. * * * * *
* * * * * THIS DOCUMENT IS NOT YET COMPLETE. * * * * *
* * * * * THIS DOCUMENT IS NOT YET COMPLETE. * * * * *
* * * * * THIS DOCUMENT IS NOT YET COMPLETE. * * * * *

( DOCUMENT CONVERSION TEMPORARILY SUSPENDED ON 2008.10.02. )

If it is more than a week beyond the date mentioned above,
then send me an e-mail message, because I must have forgotten
to finish the document conversion!  LOL!
colinfahey.com
contact information