From: Clifton Royston (no email)
Date: Tue Sep 03 2002 - 15:54:05 EDT
On Sun, Sep 01, 2002 at 08:32:59PM -0400, Steve Manes wrote:
> At 01:47 AM 9/2/2002 +0200, Bert Driehuis wrote:
> >Which is fine if you can afford the collateral damage, but if you then
> >get a Chinese user on your system you'ld be in a spot of trouble until
> >your system relearned with his input. I think this will only work on a
> >per-user basis.
>
> You're right about generalized rules spread out over a broad population,
> but so long as each user can select his own scoring threshold I don't think
> it's that big a deal. There's no reason why a user couldn't select, say,
> 98% confidence for his general filter and then pipe the results through his
> own, more restrictive filter.
>
> It might be even more reliable on an ISP insofar as probably multiple users
> on the ISP would be targeted with the same spam, probably repeatedly
> because spammers tend to use common address lists.
>
> >And in particular, filtering on single words is outright asking for
> >collateral damage. The word that was mentioned in this thread, c.u.m.,
> >also happens to be part of the contruct "c.u.m. laude".
>
> As I know well having built a job board for a community web site hosting
> company that wanted to use their standard profanity filter on inbound
> resumes. Fortunately they abandoned the idea. But that's a good example
> of where an adaptive filter would work much more reliably. Rather than
> toss every message containing that string, it would be weighed against
> other words in the message which would, at least in the case of a resume,
> drive down the scoring to probably an acceptable number.
In the real world, doing this sort of computation (and dynamically
reweighting the filters for each user with each incoming message) has
serious costs. If just running a predefined set of regexes on incoming
mail can be too expensive, how expensive is recalculating your
statistical base for each incoming mail going to be?
This also presumes (like every other anti-spam silver bullet) that
spammers are completely incapable of adapting, which has been
repeatedly shown to be wrong. The checksumming method was supposed to
be a silver bullet, until spammers started adding random strings or
random dictionary words to the headers and bodies to break checksums.
If the Bayesian stats approach becomes widespread, we can expect to see
spammers both start rewriting spams to look more like a real personal
email (which I've already seen examples of) so the statistics look more
"real", and simply throwing in sprinklings of words out of a
Zipf-distribution for normal text to trigger positive rules that users'
filters may have evolved.
Both Graham and the SpamAssassin developers also commit the "data
mining" statistical fallacy: develop a set of classification measures
which perfectly fit a predefined historical data set, and proclaim its
accuracy based on its performance on the set it was trained with.
SpamAssassin used to claim something like a 0.01% false positive rate,
but its distributed rules are nowhere *near* that accurate on the tests
I've run. To the developers' credit, they seem to have dropped those
claims on the more recent version of the website.
I don't think there is going to be any magic bullet; blocking spam
accurately will continue to take a whole series of layered defenses,
backed up with continuing action on the legislative and judicial front
nationally and internationally. Bayesian statistics will be a useful
addition to the arsenal, but will not replace the whole arsenal.
-- Clifton
--
Clifton Royston -- LavaNet Systems Architect --
"What do we need to make our world come alive?
What does it take to make us sing?
While we're waiting for the next one to arrive..." - Sisters of Mercy
-
To unsubscribe, send mail to with content
(not subject): unsubscribe postfix-users
|
|
|