sf-0.6 -- spam filter for UNIX-like systems

Counter[counter] (since 2006.10.10)



Copyright (C) 2006 Masahiko Ito

These programs is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

These programs is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with these programs; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA

Mail suggestions and bug reports for these programs to
"Masahiko Ito" <m-ito@myh.no-ip.org>


What's this ?

This is script for filtering spam by client side.

Nowadays, Bayes method is often used to filtering spam. But it's a little hard to understand for stupid me :P. I thought there must be more simple and enough effective way to filter spam. and I have done to make it true :)




You must install followings before install sf-0.5.



Non-spam table(t_white) consist of terms and their counts in normal messages and spam table(t_black) consist of terms and their counts in spam messages.
create table t_white (
  term    text primary key,
  count   long long int
create table t_black (
  term    text primary key,
  count   long long int
Coming message is divided to terms(term1 to n) and count each them(count1 to n).

Search term1 from t_white and calculate white_score.

              (count in t_white) x (count1)
white_score = -----------------------------
              (sum of all count in t_white)

Search term1 from t_black and calculate black_score.

              (count in t_black) x (count1)
black_score = -----------------------------
              (sum of all count in t_black)

calculate score from white_score and black_score.

score = ------------------------- - 0.5
        white_score + black_score 
range of score is -0.5 to +0.5. Negative means spam, Positive means non-spam.

and calculate all scores of terms(term1 to n).

At final, if sum of all scores is negative, It is judged that the message is spam.

How to use


$ sf_init.sh -h
Usage : sf_init.sh
Initialize database.
It initialize database for spam data. it must be invoked only once at beginning to use sf-0.5.


$ sf_add.sh -h
Usage : sf_add.sh [-w|--white|-b|--black] [-v|--vacuum] [file ...]
Add data to database.
  -w, --white  add data to white database.
  -b, --black  add data to black database.
  -v, --vacuum vacuum after add.
It make sf-0.5 to learn spam data for adding.


$ sf_del.sh -h
Usage : sf_del.sh [-w|--white|-b|--black] [-v|--vacuum] [file ...]
Del data from database.
  -w, --white  del data from white database.
  -b, --black  del data from black database.
  -v, --vacuum vacuum after del.
It make sf-0.5 to learn spam data for deleting.


$ sf_check.sh -h
Usage : sf_check.sh [-w|--white|-b|--black] [file ...]
Check file.
  -w, --white  check white?
  -b, --black  check black?
return 0 when check is true.
return 1 when check is false.
It judge file(or stdin) is spam or non-spam. it show score(negative means spam) and return exit status (0 means TRUE, 1 means FALSE). If database is not leaned at all, it always show 0.0.

with fetchmail and procmail

It's good to use sf-0.5 with fetchmail and procmail.


poll pop.anywhere.org
proto pop3
no keep
no fetchall
mda "/usr/bin/procmail -f %F"
If `mda "/usr/bin/procmail -f %F"' line is not specified, fetchmail sends message from pop server to sendmail in local. sendmail can accept message parallel, so if too many messages are in pop server, local's loadaverage increase rapidly.

If `mda "/usr/bin/procmail -f %F"' line is specified, fetchmail calls procmail to deliver messages, so local's loadaverage may not increase rapidly.


:0 HB
* ? sf_check.sh -b