Monday, September 22, 2008

“Name and Shame”, or socially responsible use of your log data

Your logs contain an ever-growing mass of data on spammers. How about making an effort to make that data useful to others?

Those of us who run email services know, from sometimes painful experience, what it takes to ensure that the minimum possible amount of unwanted advertising and scams that may turn out to be security hazards reaches our users' inboxes.

Email: This should have been very simple
Handling email should really be quite simple: The server is configured to know what domains it receives mail for and what users actually exist in those domains. When a machine makes contact and indicates that it intends to deliver email, the server check if the recipient is a valid user. If the recipient is valid, the message is received and put in the relevant user's mailbox. Otherwise, a message about a failed delivery and optionally the reason for the failure is sent to the user specified as the sender.

If they were all honest people
In each part of the process, the underlying premise is that the communicating parners offer each other correct information. Frequently that is the case, and we have legitimate communications between partners with a valid reason for contacting each other. Unfortunately there are other cases where the implicit trust is abused, such as when email messages are sent with a sender address other than the real one, quite likely a made-up one in a domain that belongs to other people. Some of us occasionally receive delivery failure messages for messages we verfiably did not send[1]. If we take the time to study the contents of those messages, in almost all cases we will find that the messages are spam, sometimes the scamming kind and perhaps part of an attempt to take control of the recipient's computer or steal sensitive data.

What do the ones in charge do, then?
If you ask a typical system administrator what measures are in effect to thwart attempts at delivering unwanted or malicious messages to their users, you will most likely get a description that says, essentially, the messages are filtered through systems that inspect message contents. If the message does not contain anything known to be bad (known spam or malware) or something sufficiently similar to a known bad, the message is delivered to the user's mailbox. If the system determines that the message contents indicates it should not be delivered, the messages is thrown away undelivered, and some system administrators will tell you that the system also sends a message about the decision not to deliver the message to the stated sender address.

Large parts or this is likely part of moderately educated users' passive knowledge, and most of us are likely to accept that content filtering is all we can do to keep dubious or downright criminal elements out of our working environment. For the individual end user, only minor adjustments to this are likely to be possible.

Measures based on observed behavior
But those of us who actually run the service also have the opportunity to study the automatically generated log data from our systems and use spammers' (that is, senders of all types of unwanted mail, including malware) behavior patterns to remove most of the unwanted traffic before actual message content is known. In order to do that, it is necessary to go to a more basic level of network traffic and study sender behavior on the network level.

One of the simpler forms of behavior based measures emerged in the form of a technique called greylisting in 2003. The technique is based on a slightly pedantic and rather creative interpretation of established standards. The Internet protocol for email transfer, SMTP (the Simple Mail Transfer Protocol) allows servers that experience temporary problems that make it impossible to receive mail to report a specific 'temporary local problem' status code to correspondents trying to deliver mail. Correctly configured senders will interpret and act on the status code and delay delivery for a short time. In most circumstances, the delivery will succeed within a short time. It is worth noting that this part of the standard was formulated to help the mail service's reliability. At most times, the retries happen without alerting the person who wrote and sent the message. The messages generally reach their destination eventually.

Lists of grey and black, little white lies
Greylisting works like this: the server reports a temporary local problem to all attempts at delivery from machines the server has not exchanged mail with earlier. Experience shows that the pre-experiment hypothesis was mainly correct: Essentially all machines that try to deliver valid email are configured to check return codes and act on them, while almost all spam senders dump as many messages as possible, and never check any return codes. This means that somewhere in the eighty to high nineties percentage of all spam volume is discarded at the first delivery attempt (before any content filtering), while legitimate email reaches its intended recipients, occasionally with delayed delivery of the initial message from a new correspondent.

One other behavior based technique that predates greylisting is the use of 'blacklists' - lists of machines that have been classified as spam senders - and rejecting mail from machines on such lists. Some groups eventually started experimenting with 'tarpits', a technique that essentially means your end of the communication moves along very slowly. A much cited example is the spamd program, released as a part of the free operating system OpenBSD in May of 2003. The program's main purpose at the time was to answer email traffic from blacklisted hosts one byte per second, never leaving a blacklisted host any real chance of delivering messages.

The combination of blacklists and greylisting proved to work very well, but the quest for even more effective measures continued. Yet again, the next logical step grew out of observing spammer behavior. We saw earlier that spammers do not bother to check whether individual messages are in fact delivered.

Laying traps and bait
By early 2005, these observations lead to a theory that was soon proved useful: If we have one or more addresses in our own domains the are certain to never receive any valid mail, we can be almost a hundred percent certain that any mail addressed to those addresses is spam. The addresses are spamtraps. Any machines that try to deliver spam to those addresses are placed in a local blacklist, and we keep them busy by answering their traffic at a rate of one byte per second. The machines stay on the blacklist for 24 hours unless otherwise specified.

The new technique, dubbed greytrapping was launched as part of the improved spamd in OpenBSD 3.8, released May 2005. In early 2006, Bob Beck, one of the main spamd developers announced that his greytrapping hosts at the University of Alberta generates a downloadable blacklist based on the greyptrap data, updated once per hour, ready for inclusion in spamd setups elsewhere. This is obviously useful. Machines that try to deliver mail to addresses that were never deliverable most likely do not have any valid mail to deliver, and it we are doing society at large a favor by delaying their deliveries and wasting their time to the maximum extent possible.

It is worth mentioning that during the period we have used the University of Alberta blacklist at our site, it has contained a minimum of twenty-some thousand IP addresses, and during some busy periods have reached almost two hundred thousand.

You can help, too
Fortunately you do not need to be a core developer to be able to contribute. The exact same tools Bob Beck uses to generate his blacklist is available to everybody else as part of OpenBSD, and they are actually not very hard to use productively.

Here at BSDdly.net and associated domains we saw during the (Northern hemisphere) summer of 2007 a marked increase in email sent to addresses that have never actually existed in our domains. This was clearly a case of somebody, one or more groups, making up or generating sender addresses to avoid seeing any reactions to the spam they were sending. This in turn lead to us starting an experiment that is still ongoing. We record invalid addresses in our own domains as they turn up in our logs. From these addresses we pick the really improbable ones, put them in our local spamtrap list and publish the list on a specific web page on our server[2].

Experience shows that it it takes a very short time for the addresses we put on the web page to turn up as target addresses for spam. This means that we have succeeded in feeding the spammers data that makes it easier for us to stop their attempts, and frequently we make spam senders use significant amounts of time communicating with our machines with no chance of actually achieving anything. The number of spamtrap addresses has reached fifteen thousand, and we have at times observed groups of machines that spend weeks working through the whole list, with average time spent per unsuccessful delivery attempt clocked at roughly seven minutes.

As a byproduct of the active spammer trapping we started exporting our own list of machines that had been trapped via the spamtrap addresses during the last 24 hours and making the list available for download. This list's existence has only been announced via the spamtrap addresses web page and a few blog posts, but we see that it's retrieved, most likely automatically, at intervals and is apparently used by other sites in their systems.

At this point we have established that it is possible to create a system that makes it very unlikely that spam actually makes it through to users, while at the same time it is quite unlikely that legitimate mail is adversely affected. In other words, we have the cyberspace equivalent of good fences around our property, but spammers are still out there and may create serious probles for those who are without adequate protection.

Collecting evidence, or at least seek clarity
We would have loved to see law enforcement take the spammer problem seriously. This is not just because the spam that reaches its targets is irritating, but rather because almost all spam is sent via equipment that spammers use without the legal owners' consent. We would have liked to see resources allocated in proportion to the criminal activity the spam represents. We would have liked to help, but it might seem that we would not have usable evidence available due to the fact that we do not actually receive the messages the spammers try to deliver. On the other hand, we have at all times a list of machines that have tried to deliver spam, identified with an almost hundred percent certainty based on the spammer trapping addresses. In addition, our systems routinely produce logs of all activity, with the level of detail we set ourselves. This means that it is possible to search our logs for the IP addresses that have tried to deliver spam to our systems during the last 24 hours, and get a summary of what those machines have done.

A search of this kind typically yields a result like this:

Aug 10 02:34:29 skapet spamd[13548]: 190.20.132.16: connected (4/3)
Aug 10 02:34:41 skapet spamd[13548]: (GREY) 190.20.132.16: <kristie@iland.net> -> <asasaskosmicki@bsdly.net>
Aug 10 02:34:41 skapet spamd[13548]: 190.20.132.16: disconnected after 12 seconds.
Aug 10 03:41:42 skapet spamd[13548]: 190.20.132.16: connected (14/13), lists: spamd-greytrap
Aug 10 03:42:23 skapet spamd[13548]: 190.20.132.16: disconnected after 41 seconds. lists: spamd-greytrap
Aug 10 06:30:35 skapet spamd[13548]: 190.20.132.16: connected (23/22), lists: spamd-greytrap becks
Aug 10 06:31:16 skapet spamd[13548]: 190.20.132.16: disconnected after 41 seconds. lists: spamd-greytrap becks


The first line here states that 190.20.132.16 contacts our system at 02:34:29 AM on August tenth, as the fourth active SMTP connection, three blacklisted. A few seconds later it appears that this is an attempt at delivering a message to the address asasaskosmicki@bsdly.net. That address was already one of our spamtraps, most likely one that was harvested from our logs and was originally made up somewhere else. After 12 seconds, the machine disconnects. The attempted delivery to a spamtrap address means that the machine is added to our local spamd-greytrap blacklist, as indicated in the entry for the next attempt about one hour later. This second attempt lasts for 41 seconds. The third try in our log material happens just after 06:30, and the addition of the list name becks indicates that in the meantime has tried to deliver to one of Bob Beck's spammer trap addresses and has entered that blacklist, too.

Unfortunately, it is unlikely that logs of this kind are sufficient as evidence for criminal prosecution purposes, but the data may be of some use to those who have an interest in keeping machines in their care from sending spam.

“Name And Shame“, or just being neighborly?
After some discussions with colleagues I decided in early August 2008 to generate daily reports of the activities of machines that had made it into the local blacklist on bsdly.net and publish the results. If all we have is the fact that a machine has entered a blacklist as an IP address (such as 24.165.4.190), and there is no supporting material, it is fairly easy for whoever is in charge of that address range to just ignore the entry as an unsupported allegation. We hope that when whoever is responsible for the network containing 24.165.4.190 sees a sequence like this,

Host 24.165.4.190:
Aug 10 02:57:40 skapet spamd[13548]: 24.165.4.190: connected (9/8)
Aug 10 02:57:54 skapet spamd[13548]: (GREY) 24.165.4.190: <hand@itnmiami.com> -> <kimberlee.ledet@ehtrib.org>
Aug 10 02:57:55 skapet spamd[13548]: (GREY) 24.165.4.190: <hand@itnmiami.com> -> <kimberliereffett@ehtrib.org>
Aug 10 02:57:56 skapet spamd[13548]: 24.165.4.190: disconnected after 16 seconds.
Aug 10 02:58:16 skapet spamd[13548]: 24.165.4.190: connected (8/6)
Aug 10 02:58:30 skapet spamd[13548]: (GREY) 24.165.4.190: <brunson@jebconet.com> -> <kimberlee.ledet@ehtrib.org>
Aug 10 02:58:31 skapet spamd[13548]: (GREY) 24.165.4.190: <brunson@jebconet.com> -> <kimberliereffett@ehtrib.org>
Aug 10 02:58:32 skapet spamd[13548]: 24.165.4.190: disconnected after 16 seconds.
Aug 10 02:58:39 skapet spamd[13548]: 24.165.4.190: connected (7/6), lists: spamd-greytrap
Aug 10 03:02:24 skapet spamd[13548]: (BLACK) 24.165.4.190: <aarnq@abtinc.com> -> <kimberlee.ledet@ehtrib.org>
Aug 10 03:03:17 skapet spamd[13548]: (BLACK) 24.165.4.190: <aarnq@abtinc.com> -> <kimberliereffett@ehtrib.org>
Aug 10 03:05:01 skapet spamd[13548]: 24.165.4.190: From: "Preston Amos" <aarnq@abtinc.com>
Aug 10 03:05:01 skapet spamd[13548]: 24.165.4.190: To: kimberlee.ledet@ehtrib.org
Aug 10 03:05:01 skapet spamd[13548]: 24.165.4.190: Subject: Wonderful enhancing effect on your manhood.
Aug 10 03:06:04 skapet spamd[13548]: 24.165.4.190: disconnected after 445 seconds. lists: spamd-greytrap

they will find that to be a sufficient for action of some kind. The material we generate is available via the “The Name And Shame Robot” web page. The latest complete report of log excerpts is available via links at that page. Previous versions are archived offline, but will be made available on request to parties with valid reasons to request the data.

“The Name And Shame Robot”" is rather new, and it is too early to say what effect, if any, the publication has had. We hope that others will do similar things based on their local log data or even synchronize their data with ours. If you are interested in participating, please make contact.

Regardless of other factors, we hope that the data can be useful as indicators of potential for improvement in the networks that appear regularly in the reports as well as material for studies that will produce even better techniques for spam avoidance.

A shorter version of this article in Norwegian was published in Computerworld's Norwegian edition on August 22, 2008; the longer Norwegian version is available as an earlier blog post.

[1] A collection of such failure messages collected earlier this year is available at http://www.bsdly.net/~peter/joejob-archive.2008-07-28.txt.

[2] See http://www.bsdly.net/~peter/traplist.shtml, references at that page lead to my blog, which consists of public field notes, as well as other relevant material.

About the author
Peter N. M. Hansteen (peter@bsdly.net) is a consultant, system administrator and writer, based in Bergen, Norway. In October 2008, he joined FreeCode, the Norwegian free software consultancy. He has written various articles as well as "The Book of PF", published by No Starch Press in 2007, and lectures on Unix- and network-related topics. He is a main organizer of BLUG (Bergen (BSD and) Linux User Group), vice president of NUUG (Norwegian Unix User Group) and an occasional activist for EFF's Norwegian sister organization EFN (Elektronisk Forpost Norge).