Reuters-21578 subset: a dataset example
The
reuters.subset.tgz archive contains a subset of the
Reuters-21578 often used for
text categorization experiments. This subset contains 466 documents over 4 categories.
|
all categories |
earn |
acq |
crude |
corn |
train |
377 |
154 |
114 |
76 |
38 |
test |
89 |
42 |
26 |
15 |
10 |
The idea of this example is to have a dataset similar to one used for the experiments in the
first part of: Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini and Chris Watskins.
Text Classification using String Kernels,
Journal of Machine Learning Research 2:419-444, 2002. The size of dataset and splits are similar, however:
- The two datasets do not contains the same documents.
- In Lodhi et al., the authors also performed some text normalization (removing stop words and punctuations, ...) on the documents.
Hence, due to the lack of text normalization, we anticipate the performance of a given kernel on this dataset to be slightly worse than what is reported
in Lodhi
et al..
In the following, we assume that your
PATH
and
LD_LIBRARY_PATH
environment variables are
set as suggested in the
quick tour (
PATH
should contain
far
,
kernel/bin
and
libsvm-2.82
,
LD_LIBRARY_PATH
should contain
far
,
kernel/lib
and
kernel/plugin
).
The documents are in the
data
subdirectory. Each document is present as a text file. The
data.list
file contained the list of text files that defines our dataset.
Following Lodhi
et al., we are going to evaluate
n -gram kernels at the character level. The first step is to convert each text file into a linear automaton where each
transition represents an (ascii) character. This was done using the
farcompilestrings
utility, as shown below, and the result is a
far
file containing a collection of Fsts (in the
OpenFst library binary format) appearing in the same order as in the
data.list
file.
$ farcompilestrings --arc_type=log --entry_type=file --token_type=byte --generate_keys=3 --file_list_input data.list data.far
A normalized 4-gram kernel for this dataset can be generated using the command:
$ klngram -order=4 -sigma=256 data.far > 4-gram.kar
To evaluate the performance of this kernel for classifying the
acq
category (one vs. others).
$ svm-train -k openkernel -K 4-gram.kar acq.train acq.train.4-gram.model
open kernel successfully loaded
*
optimization finished, #iter = 362
nu = 0.339642
obj = -74.288867, rho = -0.368477
nSV = 217, nBSV = 60
Total nSV = 217
openkernel: 82563 kernel computations
The generated can then be used for prediction:
$ svm-predict acq.test acq.train.4-gram.model acq.test.4-gram.pred
Loading open kernel
open kernel: 4-gram.kar
open kernel successfully loaded
Accuracy = 89.8876% (80/89) (classification)
Mean squared error = 0.404494 (regression)
Squared correlation coefficient = 0.566988 (regression)
Finally, this prediction can be scored using the
score.sh
utility:
$ ./score.sh acq.test.4-gram.pred acq.test
true positive = 21
true negative = 59
false positive = 4
false negative = 5
---
accuracy = 0.898876
precision = 0.84
recall = 0.807692
F1 = 0.823529
This is comparable to the F1 of 0.873 reported by Lodhi
et al.
--
CyrilAllauzen - 30 Oct 2007