From: Clifton Royston (no email)
Date: Wed Sep 04 2002 - 14:24:24 EDT
On Tue, Sep 03, 2002 at 04:52:06PM -0400, Greg A. Woods wrote:
> > This also presumes (like every other anti-spam silver bullet) that
> > spammers are completely incapable of adapting, which has been
> > repeatedly shown to be wrong.
>
> Do you have a scientific criticism of Graham's assertion about how
> robust a Bayesian filter should be even when the "attacker" knows
> exactly what algorithm is being used?
Sure. And in fact, they are *already* counter-adapting to both the
regular expression body matches and the Bayesian approach. I have seen
3 different approaches already!
Approach 1, outline of proof: the Bayesian algorithm deals with
tokens. It is trivial for an "attacker" who knows the algorithm for
the tokenization to randomly break up the text in each spam being sent
out such that the same tokens do not reappear and so do not get
"learned" and can not be matched.
I just got forwarded a copy of one such spam in which HTML comments
embedded between words and in the middle of words are being used to
break them up. This would stop both most regex matches and break the
Bayesian algorithm as applied to the body, but the bulk of email
readers using HTML-enabled mail clients will not even see those breaks.
Once you get to the headers... well, the only thing you can trust is
what's filled in by your own site, so everything else could pretty well
be randomized too.
Once this becomes common, then in your case and probably Paul
Graham's case, you could tweak your tokenization rules and look for the
HTML comment tag to show up in your stats as an indication of spam.
However, that may not work for those who do receive HTML mail, and it
also would n't stop the s enders from j ust ra ndoml y bre aking up
wo rds l ike t his which impairs readability remarkably little but
again defeats the algorithm before it gets started.
Approach 2, outline of proof: take a statistically significant sample
of known non-spam, build a Bayesian frequency model from it, and
construct your sample spam according to that frequency model. I've got
another recent spam in my collection that is written as a very
plausible chatty personal email in reply to someone else's email (which
it wasn't.) Talks about having moved recently, wanting to get together
sometime, blah blah blah, and then mentions the sender's home page
(URL.) The latter turns out to be a porn site entry page. In this
email there were no typical spam phrases or keywords. The Bayesian
algorithm is going to have a very hard time with something like that; I
do not believe it can readily be mechanically distinguished from a
genuine personal email which happens to mention a URL. A really
diligent attacker would take a broad sample of real personal emails,
sampled from an ISP, use them to get word frequency counts, and then
use those as a starting point for randomly permuting a text base with
synonyms, so as to avoid building high statistical counts for any few
words. There goes the statistical recognition model - any Bayesian
metric which excludes this is provably likely to also randomly exclude
real email which does not fit the profile of what the user has received
lately.
Approach 3 (anecdotal): One of my staff has (but unfortunately didn't
save) another spam in which there are HTML comment blocks embedding
various chunks of technical discussion. This would have some chance of
ghosting through Paul Graham's filters, or other techies' filters, on
the strengths of the positive matches from the irrelevant embedded
material. This approach seems more sketchy but it might be made to
work with better analysis of words which are likely to have high
positive scores in real technical email, and hence in the positive
metrics of users of the Bayesian filtering method.
Before Cantor and Siegel, the Internet technical community kept
saying that the kind of person who would mechanically spam Usenet (or
email) was too dumb to figure out the technical details of how to do
it. I'm amazed that people still keep taking that viewpoint, this many
years further on.
Again, to repeat my earlier statement: I am *not* saying the Bayesian
technique is useless. I think it will do very well at classifying some
broad ranges of spam, and will be a very useful addition to the
arsenal. At the least, if it becomes widely used, it will raise the
ante on the necessary sophistication for spammer software. However, I
think that on the strength of one essay it is being hailed as a
panacea, and "panaceas are poison."
-- Clifton
--
Clifton Royston -- LavaNet Systems Architect --
"What do we need to make our world come alive?
What does it take to make us sing?
While we're waiting for the next one to arrive..." - Sisters of Mercy
-
To unsubscribe, send mail to with content
(not subject): unsubscribe postfix-users
|
|
|