OpenKernel Quick Tour
Under construction.
Using the library
In this quick tour, we will focus on the command-line utilities and LIBSVM plugin.
The command-line utilities are available in the
kernel/bin
sub-directory.
Preparing your data
In order to use the library, you need to represent each point in your dataset as an
fst
,
i.e., a weighted transducer (or automaton) represented in the binary format used
by the
OpenFst library. The
OpenFst quick tour
contains the relevant information for accomplishing this.
A dataset is then represented by a
Fst archive (
far
file) or by a text file containing a list of
fst
file (specified using an
absolute path). The
i -th entry in the
far
archive or in the
fst
list being the
fst
representing the
i -th point
in the dataset.
This dataset should contain both your training and testing data.
An example of dataset,
a subset of Reuters-21578, is provided with the library
and can be used to become familiar with its usage.
Creating an n -gram kernel
The
klngram
utility can be used to generate an n-gram kernel. The
-order
option
specifies the n-gram order and the
-sigma
option the size of the alphabet (
i.e.
the maximum label id). The
fst.list
specifies the dataset the kernel is operating
on. The output of
klngram
is a
kar
file (for kernel archive)
that contains both the kernel function and the dataset it is defined on.
$ klngram -order=3 -sigma=2 data.far > 3-gram.kar
In addition to n-gram kernels, the library provides tools for the creation of gappy n-gram kernels (
klngram
),
mismatch kernels (
klmismatch
) and arbitrary rational kernels (
klrational
).
Kernels can also be combined by taking their sum (
klsum
) or product (
klproduct
) or
can be composed with a polynomial (
klpolynomial
), a gaussian (
klgaussian
) or
a sigmoid (
klsigmoid
).
Generating a kernel matrix
The kernel matrix corresponding to the evaluation of the kernel on the specified dataset
can be computed using the
kleval
utility as shown here:
$ kleval 3-gram.kar > 3-gram.matrix
Assuming the size of the dataset is
n, the result will be a text file with
n lines and
n floats on each line. The
j -th value on the
i -th line correspond to the value
of the kernel for the
i -th and
j -th points in the dataset.
The kernel matrix can be partially computed by restricting the
set of values to be evaluated using the
-xmin
,
-xmax
,
-ymin
and
-ymax
flags. Assuming the lines and columns are indexed from 0 to
n - 1, the following
command can be used to only compute the (
i,
j) value if and only if 10 ≤
i,
j < 20:
kleval -xmin=10 -ymin=10 -xmax=20 -ymax=20 3-gram.kar
Using the
-libsvm
option will generate a file in the format used by LIBSVM to specify
precomputed kernels. LIBSVM users are however encouraged to use the LIBSVM plugin
as described below.
Finally, the
-kar
option allows the kernel matrix to be stored in a kar file in addition to
the kernel function and dataset.
$ kleval 3-gram.kar > 3-gram.matrix.kar
Using the LIBSVM plugin
The OpenKernel library package includes a modified version of
LIBSVM that allows the definition
of arbitrary plugins to handle the kernel computations. This version of LIBSVM is available
in the
libsvm
sub-directory. A specific plugin to allow the use of the OpenKernel library
with libsvm is provided in the
kernel/plugin
sub-directory. In order to use this plugin, you
need to add the path to the
kernel/plugin
sub-directory to your dynamic loader path (
LD_LIBRARY_PATH
on Linux,
DYLD_LIBRARY_PATH
on MacOS X).
The training and test dataset need to be specified in the usual LIBSVM format (if you are not familiar with
LIBSVM check out the
official website or
the
README
file in the
libsvm
directory).
For instance a text file
train
such as:
1 1:1.0
-1 2:1.0
1 4:1.0
specifies that the 1st, 2nd and 4th points of the dataset are in the training set with labels 1, -1 and 1.
And a text file
test
such as:
-1 3:1.0
1 5:1.0
specifies that the 3rd and 5th points of the dataset are in the test set (the labels are optional here and
will only be used for scoring).
The
svm-train
utility needs to be called with two additional options. The
-k
option
specifies the type of kernel and should be
openkernel
when using the OpenKernel library.
The
-K
option specifies the
kar
file defining the kernel and dataset to be used.
All the other
svm-train
options are still available.
For example:
$ svm-train -k openkernel -K 3-gram.kar train 3-gram.model
The
svm-predict
utility does not required any additional options. The kernel information
is included in the
model
file:
$ svm-predict test 3-gram.model 3-gram.pred
When using the LIBSVM plugin, the kernel values are computed "on the fly" as requested by the LIBSVM utilities.
When performing several experiments using the same kernel (on the same dataset), it is recommended, in order to
avoid unnecessary computations, to
first compute the (partial) kernel matrix using
kleval -kar
and use the resulting
kar
file as a parameter
to the LIBSVM utilities.
--
CyrilAllauzen - 08 Oct 2007