sf-0.6 -- spam filter for UNIX-like systems

Counter[

] (since 2006.10.10)

These programs is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

These programs is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with these programs; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA

Mail suggestions and bug reports for these programs to
"Masahiko Ito" <m-ito@myh.no-ip.org>

History

2006/10/10 Ver. 0.1 released (1st)
2006/10/13 Ver. 0.2 released
- The frequency in which sqlite3 was called from sf_check.sh was reduced.
- long long int in create table was changed to integer.
2006/10/19 Ver. 0.3 released
- add information about using nkf to README.
- add information about fetchmail to README.
- bug fix about sf_check.sh frequent returns score 0 when wrong character code in message.
2006/10/24 Ver. 0.4 released
- optimize select in sf_check.sh.
- bug fix of sf_add.sh, sf_del.sh, sf_check.sh.(operation wrong term).
2007/02/21 Ver. 0.5 released
- fix mistake explain of sf_del.sh in README.
- change sf_add.sh, sf_del.sh, sh_check.sh for specification change for `uniq -c'(separator was changed TAB to SPACE ?).
2012/12/06 Ver. 0.6 released
- add information for sf with Gnus by Mr.Kosaka.
- change character encoding from euc-jp to utf-8.

What's this ?

This is script for filtering spam by client side.

Nowadays, Bayes method is often used to filtering spam. But it's a little hard to understand for stupid me :P. I thought there must be more simple and enough effective way to filter spam. and I have done to make it true :)

Download

sf-0.5.tar.gz (2007/02/21 LATEST)
sf-0.4.tar.gz (2006/10/24)
sf-0.3.tar.gz (2006/10/19)
sf-0.2.tar.gz (2006/10/13)
sf-0.1.tar.gz (2006/10/10)

Contribution

How to use sf with Gnus(in japanese)
Mr.Kosaka deveropped some scripts to use sf with Gnus. you can get more information from 00readme.txt in sf+.tar.gz(,but in japanese :<).
Thanx a lot, Mr.Kosaka. (2007/02/27)

Preinstall

You must install followings before install sf-0.5.

Install

tar xvzf sf-0.5.tar.gz
cd sf-0.5
cp sf-*.sh /anywhere/bin/
mkdir ~/.sf

Algorithm

Non-spam table(t_white) consist of terms and their counts in normal messages and spam table(t_black) consist of terms and their counts in spam messages.

create table t_white (
  term    text primary key,
  count   long long int
);
create table t_black (
  term    text primary key,
  count   long long int
);

Coming message is divided to terms(term1 to n) and count each them(count1 to n).

Search term1 from t_white and calculate white_score.

              (count in t_white) x (count1)
white_score = -----------------------------
              (sum of all count in t_white)

Search term1 from t_black and calculate black_score.

              (count in t_black) x (count1)
black_score = -----------------------------
              (sum of all count in t_black)

calculate score from white_score and black_score.

               white_score 
score = ------------------------- - 0.5
        white_score + black_score

range of score is -0.5 to +0.5. Negative means spam, Positive means non-spam.

and calculate all scores of terms(term1 to n).

At final, if sum of all scores is negative, It is judged that the message is spam.

How to use

sf_init.sh

$ sf_init.sh -h
Usage : sf_init.sh
Initialize database.

It initialize database for spam data. it must be invoked only once at beginning to use sf-0.5.

sf_add.sh

$ sf_add.sh -h
Usage : sf_add.sh [-w|--white|-b|--black] [-v|--vacuum] [file ...]
Add data to database.
  -w, --white  add data to white database.
  -b, --black  add data to black database.
  -v, --vacuum vacuum after add.

It make sf-0.5 to learn spam data for adding.

sf_del.sh

$ sf_del.sh -h
Usage : sf_del.sh [-w|--white|-b|--black] [-v|--vacuum] [file ...]
Del data from database.
  -w, --white  del data from white database.
  -b, --black  del data from black database.
  -v, --vacuum vacuum after del.

It make sf-0.5 to learn spam data for deleting.

sf_check.sh

$ sf_check.sh -h
Usage : sf_check.sh [-w|--white|-b|--black] [file ...]
Check file.
  -w, --white  check white?
  -b, --black  check black?
return 0 when check is true.
return 1 when check is false.

It judge file(or stdin) is spam or non-spam. it show score(negative means spam) and return exit status (0 means TRUE, 1 means FALSE). If database is not leaned at all, it always show 0.0.

with fetchmail and procmail

It's good to use sf-0.5 with fetchmail and procmail.

.fetchmailrc

poll pop.anywhere.org
proto pop3
user POP_ACCOUNT_NAME
password POP_PASSWORD
is LOCAL_USERNAME
no keep
flush
no fetchall
mda "/usr/bin/procmail -f %F"

If `mda "/usr/bin/procmail -f %F"' line is not specified, fetchmail sends message from pop server to sendmail in local. sendmail can accept message parallel, so if too many messages are in pop server, local's loadaverage increase rapidly.

If `mda "/usr/bin/procmail -f %F"' line is specified, fetchmail calls procmail to deliver messages, so local's loadaverage may not increase rapidly.

.procmailrc

:0 HB
* ? sf_check.sh -b
/home/foo/Mail/spam/.

BUGS

If there are several character code in message, sf-*.sh may get confusion.

m-ito@myh.no-ip.org

[更新]