Real-world experiences with Challenge-Response email filtering

Enno Davids


The problem of spam is one that seems to have no end as yet. Many solutions have been proposed both technical and legislative and work wih varying degrees of efficacy. This paper discusses some real-world experiences with a type of filtering known as challenge-response filtering and why you might want to consider it as a possible spam fighting solution.

The Mission

SMTP based email systems were invented in a simpler and possibly more naive time. The focus was on getting mail delivered at all and sheer utility of being able to swap non-paper based mail with people near instantaneously across cities, states, countries and continents is hard to describe to those of you who came along later. The focus was getting the mail delivered at all.

Somewhere along the way, people discovered that they could send mail in a broadcast manner as well as a directed manner, especially if the underlying networks co-operated. (Some of you will remember mail being sent to root@*.oz sucessfully for instance). It didn't take long for budding if lazy entrepreneurs to decide to leverage this new medium for advertising and thus was born the modern scourge of spam.

Wind forward to the present day and the email environment has changed radically. It is not unheard of for individuals to try to deal with 1000's of emails a day a servers that deal with two or three orders of magnitude more are common. But over and above this a large proportion of the email that is 'handled', typically between 1/3 and 1/2 at the time of writing, is spam. Todays mail system has thus over time suffered a new indignity. Not only do we want it to deliver mail quickly efficiently and reliably, but we also have decided to make it responsible for not delivering some of the incoming mail flood at all. This is the new mission statement for spam filtering solutions.

  • Stop SPAM.
  • Don't stop HAM

There is also a question here of exactly what level of tolerance we have for spam that makes it past the filtering facilities. In the 'good old days'(tm) we used to see only a handful of spams and out tolerance for the few that slipped through the net was high.

As time passed, the spammers became more adept at slipping their messages through and as they did we became less and less prepared to accept any real amount of spam. The filters became more agressive.

Now we entered an arms race period (which we are still in). Spammers find new techniques to slip the cordon, we invent ever more subtle filtering and ever more agressive filtering. ALong the way we have gone from tolerating a few spams that get through to requiring fewer spams making it through the cordon at the expense of tolerating a little real email being stopped. People will tend to err on one side or the other depending on their level of dislike for spam (or perhaps how long they've suffered).

But it pays to consider the effects of your filter on the overall behaviour of your email system. Unsurprisingly many people don't consider the effects of the filter well. After all, their email system is mostly about delivery not about non-delivery.

So here's a thought exercise for you. The CEO of a large corporate complains that he's has tired of seeing penis enlargement ads in his inbox. This is not uncommon. I've seen it firsthand in a few corporate environments now. The CEO announces he requires stronger filtering. He also notes though that he will not tolerate mail from the board of directors to him being stopped. He is required by law to respond to all enquiries by the stock exchange on behalf of shareholders and in some environments to enquiries from shareholders themsleves (certainly from large institutional investors if he knows what's good for him).

So the challenge is that in such an environment, build a filter that works at all, a filter with absolutely NO false negatives and NO false positives. Hiring extra PA's to read his unfiltered email stream is not an acceptable option. (And in some legal environments it may be borderline illegal to subject young female staff to a barrage of penis enlargement ads and porno spam, even if you thin them out with bogus mortgage offers, phishing attacks and the like.)

What is SPAMminess?

So the first question then becomes how do we judge spamminess. The original approaches were fairly mundane keyword matching systems. Say the wrong thing and your message is relegated to the bit bucket...

As things progressed, phrases substituted in place of simple keywords and then after phrases statistical methods based on word probabilities such as Bayesian analysis. The top of the heap are now composite systems that apply a range of spam categorization techniques and then combine them in some manner. (Weighted scoring systems being the most common in the guise of the SpamAssassin open source system).

The real problem we have with categorizing spam though is hidden in the name we've chosen. There is of course an alternate nomenclature and it makes our problem a little easier to see.

The alternate naming is:

  • UCE - unsolicited commercial email
  • UBE - unsolicited bulk email

As it happens, filters are getting pretty good at spotting the 'B' and 'C' characteristics. Properly trained Bayesian filters have good accuracy and repeatability in this area. (Which makes the question how to properly train one all the more interesting...) It's the 'U' component of the name our filters have more trouble with. How does a filter test for solicitation after all? For an example here trying getting your filter to distinguish real email from your bank from a phishing attack that uses most of your bank's verbiage and web site links. At this point in time the solutions are I'm afraid fairly ugly, namely all mail that triggers the 'B' or 'C' tests is commonly assumed to be spam. Try signing up to a new mailing list and then face the task of re-training your Bayesian filter when nothing arrives. And this doesn't yet deal with the spammers modulating their message to try fooling the filters ('V14Gr4' in place of that popular recreational pharmaceutical... this tactic clearly no longer works, but there was a time when it did of course.)

Over-riding your filter

The way we now deal with some of these problems is of course to put in some mechanism that allows us to override our filtering. These are the traditional overrides:

  • blacklists - mail we never want to see delivered identified in keywords or some other technology we saw earlier.
  • whitelists - mail that must be delivered (the board's email addresses from our earlier examples).

While this is a great solution, it deals with the problem we had in a direct and efficacious manner, it suffers from one remaining drawback. Namely, how do we seed these files? Most typically we fill them or add to them immediately after something goes wrong. Thus we white-list as soon as someone complains of non-delivery. We blacklist addresses immedaitely after we see there is spam coming from there. Its a nice mechanism, but its reactive and by definition it reacts just one moment too late.

Its also worth noting that the world doesn't really cooperate with us here. People who operate mailing lists for instance could tell us on the sign up page that "list mail will always have the following sender address on it", allowing us to whitelist that address prior to the first email arriving. No one does this of course and the population of people who could use this information (us I guess) is equally small.

What do you do when SPAM is found?

This leads to the next obvious question. When your filter finds spam, or at least something it thinks could be spam, what does it do? There are a couple of common actions:

  • drop - how will users know?
  • block & hold - how will users know?
  • challenge - taunt the sender?
  • label/advise - defer to MUA policy? (i.e. let the user decide to quarantine/discard/etc.)
  • nothing - historical

You may have seen most of these in action. They ALL have their drawbacks I'm afraid. In slightly more detail then:

  • Dropping mail, is the now classic way of making your mail system appear unreliable. Especially if it drops email silently. The difference between this policy and a buggy mail system can be quite subtle.

  • Block & Hold is the action where suspect mail is held in a quarantine area against the chance that someone might come looking for it. The real problem here is that the sender typically doesn't know the mail has been stopped en route and the recipient likely doesn't know he or she has failed to receive something. This is where you then fall back to the "did you get my email?" phone call and often just discuss the subject on the phone and forget about the email altogether. Big win for electronic communications.

  • Challenge then adds the step of advising the sender that his or her email was stopped. The sender, if you think about it, is well placed to want to know. Often the sender is the only one who cares that an email arrives promptly. (c.f. the return reciept mechanism so they can tell too).

  • Label/advise is the mode where the spam filter just labels the email with a spamminess score or tag. This relies on the users mail agent to do something with the tags (e.g. send them to a spam folder, etc.) In fact, the actions the mail user agent are also pretty much those listed above. All that the label/advise mechanism does is centralises the computational load of determining spamminess (when clearly spreading it out might be smarter...). Note that, we can't build hybrids that for instance quarantine the email and then mail the recipient that we've done so as we don't reduce the spam load in that scenario. (Although daily summaries can be used to good effect in this and other scenarios against the cost of slowing delivery).

  • Doing nothing is of course still a widespread option. For instance, it is completely normal (and desirable) for your ISP to not spam filter your email, especially if they aren't where your email gets stored prior to it being read. More generally any SMTP host which is acting only as a relay shouldn't also be filtering. This has become less important with the demise of the open relay and the older store and forward model of cooperative mail delivery in networks like UUCP, ACSnet and a lesser extent the old pre-spam Internet as a whole.

So, as we noted, each method has its drawbacks, but which is the lesser of the evils...

It is briefly useful to look at the actual results spam filters produce. Whilst there are scoring systems and statistical analyses and the like, somewhere in the filter a binary decision is made. Spam, or not spam. The results thus are:

  • positive - spam found
  • negative - ham found

In fact its just a bit subtler than this:

  • true positive - spam found
  • true negative - ham found
  • false positive - ham mis-identified as spam
  • false negative - spam mis-identified as ham

The latter two are the categories we mostly need to worry about... they also happen to be the categories that most spam filters deal with only poorly.

In fact, the performance of a filter on the last two categories is mostly how filters are judged. You will often see bake-off style comparisons of various filter implementations and filter technologies and the comparisons all deal with mostly various interpretations of the above divisions. Filter efficacy is often seen as little more than a tuning problem and beyond that systemic behaviour is seen as a fundamental characteristic of the filter, which is to say that people believe there is nothing further that can be done. This is of course why combination filters have become so common and indirectly why what happens after the spam filtering process is seen as such an intractable problem in some quarters.

So, to some extent the question of how to recover from a false categorization of email is one that strikes at the heart of the filtering probelm. Especially as we guarantee by definition that a perfect filter does not exist and never will and that the obvious corollary is that we must have a strategy for dealing with filtering failures.

The typical answers to miscategorisation are those we might expect. Human intervention, i.e. some human inspecting the email and overriding the filters opinion of spam/ham. Hopefully there is an associated feedback mechanism here to improve the filters performance in the future. This in turn tends to be the tuning we spoke of earlier. Tuning the scoring system in a composite filter, adding to the ham/spam corpus your bayesian filter is trained on, adjusting black/white lists and so on. This is the reactive, 'just too late' response we noted earlier.

For traditional filters this almost always means finding spam in your inbox or trolling through a spam quarantine (either central or MUA based local) for mis-identified ham. Either way its a thankless task and one we seldom undertake unless we are forced to.

In the challenge/response model of course we have the same problem except the task is vastly simplified as the sender is notified when the one or occasionally few emails they've sent has been tagged as spam. The sender can react promptly and the problem can be solved. Having reacted, automated mechanisms can update/tune/tweak the filter in the same manners as described above.

All of these scenarios are basically due to the single fact that we tend to trust our filters too much. They're not perfect, we know that, but they're all we have between us and the crap-flood that is attacking our inboxes and short of deploying humans (with their much subtler abilities to discern ham from spam) we have little choice but to trust in the automation we have constructed.

And when it fails, we have this conversation....

"Did you get my email today?"
"No? I'd better check my spam filter."

The view, rightly or wrongly is that "as spam volumes have increased, the reliability of mail delivery has decreased". That somehow the sheer volume of mail is making the infrastructure shaky or non-deterministic.

The reality is of course merely that as we have increased the agressiveness of our filtering, we have compromised the ability of our mail systems to perform their tasks. I say, "as spam filters have become more aggressive (from need) the likelihood of legitimate email being mishandled and lost has increased"

The well known quote attributed to Walt Kelly's Pogo cartoon strip seems oddly appropriate: "we have met the enemy and he is us"

Challenge based MTAs

So lets look at how a challenge based MTA deals with these problems. The differences are in fact less than most people debating the usefulness of challenge/response as a concept seem to realize.

Like traditional filters, all incoming email is checked for spamminess. Whitelisted email is exempted in the usual manner. Spammy mail is always quarantined, except that now, someone is informed. As we noted earlier, we can't really tell the recipient without simply not helping the probelm. Trading a spam email for a notification that an email might have been spam and was stopped is not a win for the recipient or their mailbox. The signal (the mail they actually want) still disappears in the noise (the spam or the notifications of spam). A solution here is to only produce a summary at days end say, trading lots of spam email for a single notification, but the cost then is a delay to email of at least a day, more if you only check the notification infrequently.

As the name implies, in the challenge/response filter model and email is sent back to the sender of the email. This then is the first and frankly in some ways most important difference between conventional spam filtering and C-R systems. Mail is never silently stopped. Someone is notified, and its the someone who should have the greatest interest in seeing that delivery takes place (or they wouldn't have sent the email in the first place right?) A bit of logging goes a long way here too as an aside. A log which unequivocally shows the disposition of all incoming emails can be quite useful in the inevitable "the filter ate my mail" discussions some users insist on having.

Normally though in this MTA model, the challenge is responded to and the MTA will re-insert the email in the queue for normal delivery. Given the automated nature of this processing it can also undertake the tuning of the filter automatically and thus continually refine the filter without any need for further human intervention. This is still technically a reactive process you'll note, except the delays between the filter making a mistake and it being acted on are now as small as we can make them. This automated tuning of the filter may in fact be first, not yet achieved by other filter technologies which have failed to close the feedback loop. Once any response has been received from a mail source address that address should be whitelisted and thus should never again be challenged by the C-R filter. Thus, the user experience for real people should be a once off inconvenience and no more than that. (In practice people tend to change email addresses and have a number of places they might send email from. The downside is that each address will likely incur this once off cost... Experience shows though that few of us change email address as often as once a month and many of us doggedly hang on to addresses for decades.)

Its also interesting to note the various discussions that suggest that a solution to the spam problem is to make the senders of email incur some computational cost, akin to a paying for a stamp in real world paper mail. Like the real world, we (occasionally) require the sender to co-operate before we deliver email. (Note that the advent of distributed botnets makes the notion of computational cost stopping spammers moot as they distribute the sending operation across as many computers as it takes to keep their throughput up and they don't care about making zombie system churn away longer on individual emails.)

The reader will also note that in the discussion above we've hand waved around the question of how a C-R filter decides spamminess or not of an email. In fact, C-R can use all the same methods as more traditional filters. In fact, in my opinion it should.

Much of the acrimony around C-R systems is a direct result of the fact that with C-R in place to catch mis-identified email, it is now possible to use very poor filtering technologies to identify spam, including the ultimate case of simply challenging all incoming mail and making everybody whitelist themselves once before their email will pass the filter unmolested.

It should be noted that earlier we said that each sending address needs to respond once to be whitelisted and pass through the filter unmolested from then on. Good filtering algorithm's like Bayesian or scoring based can in fact reduce even this pain. Most non-spam will pass throuhg this system without ever receiving a challenge, without any need to be white or blacklisted. Most users and most of their correspondants will never see the C-R mechanism, until that day where they feel compelled to share that great penis enlargement pill joke. And then, its there to allow them to continue their conversation without half of it disappearing into the ether.

Another perhaps subtle feature of the C-R system is that we have achieved the human based review system for incoming spam, except that we have turned it into a distributed problem too. The spam mail is now inspected by a human. The human who sent it presumably. If the human didn't send it or the email adddress is bogus, then no response will be forthcoming and the spam will be discarded. If the sender did send the email, they need to respond only this once and they're done forever. The filter will now recognise them as legitimate and in future pass their email automatically.

What can go wrong

So having seen how it works, it makes sense to look at failure modes for this type of filter. As we noted, there are no perfect filters and this one isn't perfect either. (Just better!)

The first problem is that whitelists now allow mail through. Of course they do. Its their function of course. But this means that the spammer can exploit this, either by sending email from ana address they expect to have been whitelisted or lucking into an address that has been whitelisted. This can be a real problem when people we correspond with get viruses and these viruses send mail out in their name. The virus email now passes through the spam filter unmolested. The solution here is that virus scanning should remain a spearate pass and that it not issue challenges, although it should still notify people that its stopping email. (Mostly so that when it does it wrongly we can detect this... the whole issue of not doing things silently so we know they happenned.) It must be noted that the rate of false positives for virus scanning is orders of magnitude lower than spam filtering sees, although the false negative rate is higher (depending on how up to date your scanners signature database and scanning hueristics are).

The biggest problem with C-R systems is perhaps unsurprisingly the human factor. In short, we see some of the following behaviours in systems such as this:

  • some people don't understand the challenge email. Typically
    • damn furriners!
    • plain old dopey buggers
    • the non-computer literate

  • - some people just don't like C-R systems. These people play the role of "conscientious objectors" and will purposely try to misbehave in the presence of C-R email systems. (Usually these are the poeple who are objecting to the cost when their email address is hi-jacked by spammers as a sender address. A million bounces later your C-R filter's messge is seen as just another provocation by them. The problem of course is the spammer in this scenario, not really the C-R system but its understandable their nerves have gotten frayed...)

This is a real problem -- we have constructed a mail delivery system that not only has chosen to not deliver your email but has also decided to engage you in conversation about that fact... In the words of HAL 9000, "I'm sorry Dave, I can't do that..."

Another common problem is with mailing lists. As noted earlier, many mailing lists look exactly like bulk mail. Shock horror. They of course are in some sense. But your filter happily challenges the mailing list and in extemis, your challenge goes to all the list subscribers. (We see this fairly frequently with the good old out of office message of course).

Some list admins have thin skins here too and in some cases a single challenge from your C-R mailer is enough to get you thrown off the list. Clearly mailing lists need to be whitelisted, but as noted earlier few offer you sufficient information to allow them to be whitelisted prior to receiving that first email from them (i.e. what address does your list send from?) Sometimes there are other things about the mailing list you could try to recognize as part of your whitelisting. The "Precednce: bulk" header for example, a Mailman header perhaps or the like. The problem here is that these are fairly generic things and will represent vectors for spam to get through your filter. (There is spam now faking Mailman headers for example...)

Another failure mode, one beloved of those opposing C-R systems is that of spammers who respond to the challenge mail. This is of course a much rarer occurence than it first seems as almost no spam goes out under the spammers own address and hence few spammers are well placed to know that they need to respond. The most common spammers who manage this are those ethically challenged legitimate bsuinesses who decide bulk email is a tool of choice in this millenium. As we noted, sometimes spam is sent with hi-jacked fraudulent from addresses (although SPF and DomainKeys may make this easier to detect and reject) and the owners of these addresses may sometimes respond to the challenge out of spite? (Allowing the spam to be delivered and whitelisted.) Some will go so far as to sign you up for various block lists. (This is apparently how one claims the moral high ground in this debate???) The real issue here is the poor bugger whose email address is gazumped in this manner and is now getting thousands of bounce messages, nastygram replies and other things in his email box as a result. If a thousand mailboxes on your system are targetted, and he is subjected to a thousand challenges, he may well believe he is being mailbombed and try some form of retribution. Once again the real problem lies elsewhere, with the bots and viruses that cause the problem, but that is small comfort. A better solution is to rate limit the challenges your C-R system sends. If it sees itself sending 1000's of challenges it should stop. In general, any address which hasn't responded to a dozen or so challenges likely never will. A small state table can make you a much better neighbour here. The bots have smartened their acts up too though and now tend to share their ghastly payloads around a number of faux senders. In a world of pain we've taken to counting this as an improvement right?

A perhaps unexpected corollary is that your mail queue by the way will become very large due to all the undeliverable challenge emails which your MTA tries very hard to deliver. Some automation to prune the queue of challenges after shorter timeouts (say a day or two) can be useful

Why should you consider challenge based MTA

Thats pretty much the case for and against C-R mail systems. The big benefit that is to be gained is that you can improve the reliability of your email. Dare we say, claw back some of the reliability. No more email disappearing into a spam bucket never to be seen again and no more awkward phone calls as people try to work out why you haven't responded to something that really demanded some reaction. This can be especially important in environments with legal ramifications which isn itself is becoming more and mor eimportant as email supplants regular mail in our daily lives.

Another reason to consider this may be that is can lower the computet costs around single pieces of email. Bayes and SpamAssasin and the various other spam categorizing systems need to invest ever larger amounts of effort in their goal of achieving accuracy and C-R may be a way to find some balance and still gain a more reliable and dependable mail delivery system with lower resource costs.

Discussion on the net

There is a fair amount of discussion on the net about C-R systems and in fact its all fairly polar discussion. As noted there are some who have decided that any mail system that challenges is bad and they often proceed to argue backwards from that ending point. This is not unlike the classic bad science we've all seen in the rest of the world. Start from the conclusion and work back to a set of useful facts is seldom a great way to do things and ultimatelt people who write evaluations haing already decided the answer ahead of time, as ever, have little of use to add to the debate. Nevertheless, many people will continue to point to these resources as supportive of flaws in the concept. So perhaps we should survey the field:

Some of the analyses above try to argue against C-R based on flaws they feel the idea has. Most in fact are really impelemtational flaws. They deserve a brief review though if only to show why these flaws aren't real. (Which is also why we didn't look at them earlier in our discussion of the shortcomings of C-R.)

"dueling C-R systems"
What happens when two users, protected by C-R systems try to email each other is the question. Surely, the first users email gets challenged, and then the challenge gets challenged and then the challenge to the challenge is itself challenged... descend into mail meltdown.

Hard to believe but this is one of the most trivial things to program round in C-R systems. The solution goes like this. The C-R system examines outbound mail and caches/whitelists the Message-ID attached by the transport. C-R systems copy the message ID to the In-Reply-To field of the challenge, much as it is copied for any other reply to a message. The challenge arrives at the senders C-R system where the ID is recognised in the header and the challenge is passed through to the original sender. This also has the added benefit that anyone who replies to an email you send them is never challenged, also a desirable outcome. The C-R system will likely also whitelist the recipient's email address from outbound email for the same reason.

"mailing lists"
Mailing lists should never be challenged. Its a simple fact. As we noted, most mailing lists can be detected fairly easily with some quite simple hueristics like headers. These hueristics as noted may lead to new loopholes spammers can exploit, but the modern scoring systems can still see the spam under the header for what it is. The solution is to err on the side of caution and supress the challenge when a mailing list may be the source of a mail that has been categorised as spammy (and note that spam does sometimes come from lists. You should just not add to the problem by challenging every person on the list...). Or the C-R system may choose to challenge the originator of the mail rather than the list address.

"Spammers can just respond"
Also mentioned earlier is the suggestion that spammers can just respond as any other sender who is challenged might respond. This of course ignores the fact that perhaps only a fraction of a percent of spam carries the real originators address on it anywhere, let alone in the From: or Reply-To: fields. More often completely bogus or hi-jacked addresses are used and these by definition should not lead to normal responses coming back to the system. The people suggesting these things though are adamant that spammers will in the face of C-R MTAs use real return addresses and auto-responders to ensure delivery of their vile cargo. Some C-R systems (mostly the commercial services) have resorted to the mangled string in a GIF that is hard to OCR as a protective mechanism against this as yet non-existant threat.

Challenge Response Systems

So, you're convinced? Where can you get a C-R system? Perhaps disappointingly, I haven't progressed to offerring mine as a package. (Although if there is sufficient demand I'm happy to do so...). All is not lost though as there are C-R systems out there, both as commercial offerings and as open source soutions.

  • Mailblocks, a commercial mail proxying service with a C-R frontend.
    http://www.mailblocks.com/ - bought by AOL and it seems no longer accepting new clients

  • ChoiceMail, a windows based personal solution

  • tmda, "Tagged Message Delivery System" - open source


Email systems are becoming less reliable as users tire of sorting th wheat from the chaff and as administrators tire of seeing complaints. Filtering solutions are more agressively scanning email than ever and rates of false categorisation are now becoming a problem.

Challenge/response email filtering prevents mail from being silently dropped or held (i.e. it is still handled this way but the sender is advised). The sender is made responsible for ensuring delivery (and until delivery, the sender is the only party that can know mail is in the pipeline in any event...). Given that for most people, C-R is a one off experience, the costs are modest compared to the benefits and C-R should be considered as a useful tool in making your email infrastructure more robust.

This presentation/paper

This paper is available online at both the AUUG web site and on my personal web site at: