PURPOSE
dbacl is a command line program which can be used to categorize
several types of text documents. Each document category is
constructed as a maximum entropy language model, with respect to
a reference measure based on digrams (character pairs).
Before recognition can take place, a number of text corpora must
be "learned". For example, an English category could be based on
a text file containing the collected works of Shakespeare. The
Gutenberg project (http://promo.net/pg/) makes freely available
many public domain works in electronic form.
After learning, any number of text files can be compared, in terms
of Bayesian posterior probabilities, with up to 128 learned categories.
The actual number of categories is limited only by available memory.
dbacl is bundled with a few other utilities:
- bayesol is a postprocessor which takes the dbacl output and computes
an optimal decision based on costs of misclassification. Together with
dbacl, this allows the construction of sophisticated, multilingual,
classification scripts, if you're not afraid of some shell scripting.
- mailcross performs email classification cross validation. It can be used
to assess the performance of custom email classification scripts based on
dbacl and bayesol.
- mailinspect reads an mbox style mail folder and displays the emails in sorted
order, based on similarity to any given category.
DOCUMENTATION
See the bundled manpage. Generic instructions can be found in the file INSTALL.
A tutorial is to be found in the file tutorial.html, and an exposition of
the algorithms is in dbacl.ps.
LICENSE
DBACL is distributed under the terms of the GNU General Public License (GPL)
which can be found in the file COPYING. The hash function code used in the
file jenkins.c is public domain, by Bob Jenkins.
BUILDING
There are several configuration options you can change in the file dbacl.h,
if you want to increase the maximum number of categories or optimize
hash table overhead.
To build and install the program, you can execute the following steps from
within the source DBACL directory:
./configure
make
make install
The last part should be executed with superuser privileges for system wide
installation. Alternatively
./configure --prefix=/home/xyzzy
make
make install
builds and installs in user xyzzy's home directory, without the need for
root privileges. In this case, the following environment variables
should be set permanently (e.g. in the file .profile):
PATH=$PATH:/home/xyzzy/bin
MANPATH=$MANPATH:/home/xyzzy/man
INTERNATIONALIZATION
dbacl uses the current locale for processing. 8-bit clean multibyte
character sets (such as UTF-8) are supported in the default mode,
and arbitrary multibyte character sets require the -i command line option.
If you intend to use the -i option together with regular expressions,
you must build with a wide character POSIX regex library: ensure that
the BOOST library is present on the system and type
./configure WIDE_REGEX=1
make
make install
Warning: there is a large performance penalty if you build dbacl this way,
which shows up whenever you use regular expressions. Only build this way if
you need correct regular expressions in a multibyte environment which isn't
8-bit clean.
OTHER DEPENDENCIES
The main filter programs dbacl and bayesol have no special dependencies, and
can always be compiled.
mailinspect uses the readline and slang libraries for screen management in
interactive mode. The configure script will check for these libraries and
if it can't find them, mailinspect will be compiled without interactive support.
mailcross is a bash shell script which calls awk and formail at various
points. It will test for the existence of these programs in your path and
refuse to run if it can't find them.
RUNNING
There is a tutorial which you can read with any web browser, point it to the
file tutorial.html. For command line options and examples of possible use,
type after installation:
man dbacl
man bayesol
man mailcross
man mailinspect
You can also find a technical description of the algorithms and statistics
in the postscript file dbacl.ps
TUTORIAL SAMPLES
The tutorial.html document comes with several sample text files:
- sample1.txt and sample4.txt are extracts from Mark Twain, Huckleberry Finn
- sample2.txt, sample3.txt, sample5.tx are extracts from Douglas Adams,
The Hitchhikers' Guide to the Galaxy
AUTHOR
Laird A. Breyer <laird@lbreyer.com>
ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`
`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø
Latest update of this package can be found at http://amiga.sourceforge.net/
ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`ø°`
`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø`°ø
·············································A·r·c·h·i·v·e··C·o·n·t·e·n·t·s··
LhA Freeware Version 2.2
Copyright © 1991-94 by Stefan Boberg.
Copyright © 1998-2000 by Jim Cooper and David Tritscher.
Listing of archive 'dbacl-1.3.lha':
Original Packed Ratio Date Time Name
-------- ------- ----- --------- -------- -------------
77560 37453 51.7% 15-Jan-03 21:19:12 +bayesol.040
162288 60196 62.9% 15-Jan-03 21:17:22 +dbacl.040
1566 1023 34.6% 05-Dec-02 01:34:44 +japanese.txt
5977 2189 63.3% 15-Dec-02 06:51:14 +mailcross
5757 2368 58.8% 29-Dec-02 01:32:38 +mailcross.1
228180 86699 62.0% 15-Jan-03 21:07:48 +mailinspect
226292 85822 62.0% 15-Jan-03 21:27:36 +mailinspect.040
6199 2593 58.1% 29-Dec-02 01:36:30 +mailinspect.1
0 0 0.0% 21-Nov-02 11:36:26 +NEWS
168 124 26.1% 07-Dec-02 13:08:36 +prop.pl
4525 2150 52.4% 29-Dec-02 01:25:44 +README
3318 1674 49.5% 06-Dec-02 00:05:42 +sample1.txt
2605 1318 49.4% 06-Dec-02 00:02:34 +sample2.txt
3073 1535 50.0% 05-Dec-02 23:57:12 +sample3.txt
3283 1653 49.6% 06-Dec-02 00:53:20 +sample4.txt
3757 1851 50.7% 08-Dec-02 08:38:06 +sample5.txt
4055 1869 53.9% 08-Dec-02 05:15:14 +sample6.txt
136 96 29.4% 06-Dec-02 10:52:08 +toy.risk
29582 10326 65.0% 15-Dec-02 07:01:20 +tutorial.html
3274 1580 51.7% 12-Aug-02 04:16:10 +ylwrap
31 31 0.0% 17-Oct-02 13:41:40 +AUTHORS
77816 37613 51.6% 15-Jan-03 21:02:54 +bayesol
4202 1844 56.1% 29-Dec-02 01:32:06 +bayesol.1
1267 655 48.3% 29-Dec-02 01:26:08 +ChangeLog
17992 7014 61.0% 12-Aug-02 04:16:10 +COPYING
161124 59441 63.1% 15-Jan-03 20:59:38 +dbacl
14851 5694 61.6% 29-Dec-02 01:31:36 +dbacl.1
435463 182427 58.1% 29-Nov-02 01:13:30 +dbacl.ps
318 166 47.7% 08-Dec-02 09:06:46 +example1.risk
452 236 47.7% 08-Dec-02 09:06:46 +example2.risk
492 258 47.5% 08-Dec-02 09:06:46 +example3.risk
-------- ------- ----- --------- --------
1485603 597898 59.7% Operation successful.
_____________________________
.Readme created with: MRea \
==============================================================================
>»>»>»>»> Some additional info about this archive:
Source: http://prdownloads.sf.net/amiga/dbacl-1.3.lha?download
FileSize: 599252 Bytes
CRC: EBCC5E0C
MD5: 7D06389E578478190ECF577E3B6F7F1E
SHA: 17F53C8D799561B112250241692A392E72135851
==============================================================================
|