Baum-Welch documentation
Training
Training is performed using the
Train
function. It takes as arguments
a FAR or WFST representing the plaintext, and channel model and a FAR
representing the ciphertext.
Normalization strategies
Training takes a template argument argument representing the desired expectation
table type. Two types are available. The state expectation table normalizes so
that all arcs leaving a state sum to semiring one. The state/input-label
expectation table normalizes so that the weights of all arcs leaving a state
with the same input label sum to semiring one; this is arguably the most
intuitive normalization strategy and so it used as the default.
Semirings and count collection strategies
In a semiring without the
_
path property_,
like the log semiring, the E-step sums over all state sequences and thus
performs true Baum-Welch training. In the case the input is a FST language model
with backoffs, attach a $$\phi$$-matcher (
fstspecial --fst_type=phi
) to ensure
proper interpretation.
In a semiring with the path property, like the tropical semiring, the E-step
sums over only the most likely state sequence, an approximation known as
Viterbi training (Brown et al. 1993:293). In the case the input is an FST
language model with backoffs, backoffs can be encoded exactly using a
$$\phi$$-matcher or approximated by $$\epsilon$$ using the default matcher. In
practice, Baum-Welch training is slightly slower than Viterbi training, though
somewhat more accurate (Jansche 2003).
Options
Train
takes an struct representing options for training. These define
the size of the batch, the maximum number of iterations for training, the
learning rate $$\alpha$$, and the comparison/quantization $$\delta$$ used to
detect convergence.
Since expectation maximization training is only locally optimal, _random
(re)starts_ can be used to avoid local minima. For particularly difficult
problems, it may be beneficial to perform a very large number of random starts
(e.g., Berg-Kilpatrick and Klein 2013). In this library, we randomly initialize
weights by sampling from the uniform distribution $$(\texttt{kDelta}, 1]$$ in
the real numbers and then mapping these values onto the appropriate values in
the desired semiring.
Randomize
can be used to generate random
restarts.
Stepwise interpolation
The actual training method uses a form of stepwise interpolation due to Liang
and Klein (2009). At time $$k$$ a weight $$W_k$$ is given by
$$W_k = (1 - \nu_k) W_{k - 1} + \nu_k M_k$$
where $$\nu_k = (k - 2)^{-\alpha}$$, $$\alpha$$ is a fixed hyperparameter
controlling exponential decay of $$\nu_k$$, and $$M_k$$ is the maximized
expectation at time $$k$$.
Decoding
`Decode` is the primary driver for decoding. Decoding with the trained
model is essentially a process of constructing a weighted lattice followed by a
shortest path computation.
In a semiring with the path property, decoding is simple:
- Compose the input, channel model, and output, and materialize the cascade.
- Compute the shortest path using the Viterbi algorithm.
- Input-project (for decipherment only), then remove epsilons and weights.
In a semiring without the path property, exact decoding is much more
challenging, and in fact we do not have an algorithm for the pair case since we
normally need to preserve the input-output correspondences. In the decipherment
case, it is possible to perform with a DFA lattice, but determinizing the entire
lattice is rarely feasible for interesting problems. Therefore we use the
following strategy:
- Compose the plaintext model, channel model, and ciphertext, and materialize the cascade, then input-project and remove epsilons to produce an acyclic, epsilon-free non-deterministic weighted acceptor (henceforth, the NFA).
- Compute $$\beta_n$$, the costs to the future in the NFA.
- Compute, on the fly, a deterministic acceptor (henceforth, the DFA); expansion of the DFA converts the NFA future costs $$\beta_n$$ to the DFA future costs $$\beta_d$$.
- Compute, on the fly, a mapping of the DFA to a semiring with the path property by converting the weights to tropical.
- Compute the shortest path of the on-the-fly tropical-semiring DFA using A*search (Hart et al. 1968) with $$\beta_d$$ as a heuristic, halting immediately after the shortest path is found.
- Convert the shortest path back to the input semiring and remove weights.
Problems
Pair modeling
In the pair scenario, we observe a set of paired input/output strings $$(i, o)$$
where $$i \in {I}^{*}$$ and $$o \in {O}^{*}$$ and wish to construct a
conditional model. This is useful for many monotonic string-to-string
transduction problems, such as grapheme-to-phoneme conversion or abbreviation
expansion.
Probabilistic formulation
We would like to estimate a conditional probability model over input-output
pairs, henceforth $$\textrm{P}(o \mid i)$$.
Finite-state formulation
We first construct FARs containing all $$i$$ and all $$o$$ pairs, respectively.
We then construct a model of the alignment $$ \Gamma \subseteq {I}^{*} \times
{O}^{*}$$. This serves as the initial (usually uniform) model for $$(i, o)$$; we
henceforth refer to it as the
channel model.
Given an estimate for the channel model, $$\hat{\Gamma}$$, we can compute the
best alignment by decoding
$$\mathrm{ShortestPath}\left[i \circ \hat{\Gamma} \circ o\right]$$
where $$\textrm{ShortestPath}$$ represents the Viterbi algorithm.
Pair language modeling
Once the alignment model $$\hat{\Gamma}$$ is estimated, we can use it to
construct a so-called
pair language model (Novak et al. 2012, 2016). This is
much like a classic language model but the actual symbols represent pairs in
$${I} \times {O}$$; unlike $$\hat{\Gamma}$$, it is a joint model. To construct
such a model:
- Compute the best $$i, o$$ alignments by decoding with $$\hat{\Gamma}$$.
- Rewrite the FSTs in the decoded alignments FAR as acceptors over a pair symbol vocabulary $${I} \times {O}$$.
- Compute a higher-order language model (e.g., using the NGram tools), with standard smoothing and shrinking options, producing a weighted finite-state acceptor.
- Convert the weighted finite-state acceptor to a finite-state transducer by rewriting the "acceptor" $${I} \times {O}$$ arcs with the corresponding "transducer" $${I} : {O}$$ arcs.
Then the pair language model can be applied using composition and decoded using
the Viterbi algorithm.
Decipherment
In the decipherment scenario, we observe a corpus of ciphertext $$c \in
{C}^{*}$$. We imagine that there exists a corpus of plaintext $$p \in {P}^{*}$$
which was used to generate the ciphertext $$c$$. Our goal is to
decode the
ciphertext; i.e., to uncover the corresponding plaintext. We assume that the
ciphertext $$c$$ has been generated (i.e., encoded) by $$\pi_o\left(p \circ
\gamma\right)$$ where $$\pi_o$$ is the output projection operator and $$\gamma$$
is the encoder (i.e., inverse key). We further assume that the ciphertext can be
recovered (i.e., decoded) using $$\pi_i\left(\gamma \circ c\right)$$ where
$$\pi_i$$ is the input projection operator. However, we do not observe either
$$\gamma$$, and must instead estimate it from data.
Probabilistic formulation
We would like to estimate a conditional probability model which predicts the
plaintext $$p$$ given $$c$$; i.e., $$\textrm{P}(p \mid c)$$. By Bayes' rule
$$\textrm{P}(p \mid c) \propto \textrm{P}(p) \textrm{P}(c \mid p).$$
That is; the conditional probability of the plaintext given observed ciphertext
is proportional to the product of the probability of the plaintext and the
conditional probability of the ciphertext given the plaintext; for any corpus of
ciphertext the denominator $$\textrm{P}(c)$$ is constant and thus can be
ignored.
We then express decoding as
$$\hat{p} = \arg\max_p \textrm{P}(p) \textrm{P}(c \mid p).$$
Finite-state formulation
We first construct a probabilistic model $$\Lambda \subseteq {P}^{*}$$, a
superset of the true plaintext $$p$$ and a model of $$(p)$$. We henceforth refer
to this as the
language model.
We then construct a model of the inverse keyspace, $$ \Gamma \subseteq {P}^{*}
\times {C}^{*}$$. This is a superset of the true encoder relation $$\gamma$$ and
serves as an initial (usually uniform) model for $$(c \mid p)$$; we henceforth
refer to it as the
channel model.
Given an estimate for the channel model, $$\hat{\Gamma}$$, we can express
decoding as
$$\mathrm{ShortestPath}\left[\pi_i\left(\Lambda \circ \hat{\Gamma}
\circ c\right)\right]$$
where $$\textrm{ShortestPath}$$ represents the Viterbi algorithm.
Sparsity penalties
Knight et al. (2006) suggest maximizing
$$\hat{p} = \arg\max_p \textrm{P}(p) \textrm{P}(c \mid p)^k$$
where $$k$$ is some real number $$> 1$$ during decoding. This strategy increases
the sparsity of the hypotheses generated by the trained model. This penalty can
be applied using
fstmap --map_type=power --power=$K
before decoding, if
desired.
References
Berg-Kilpatrick, T., and Klein, D. 2013. Decipherment with a million random restarts. In
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 874-878.
Brown, P., Della Pietra, S. A., Della Pietra, V. J., and Mercer, R. L. 1993. The mathematics of statistical machine translation: parameter estimation.
Computational Linguistics 19(2): 263-311.
Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm.
Journal of the Royal Statistical Society, Series B, 39(1): 1-38.
Hart, P. E., Nilsson, N. J., Raphael, B. 1968. A formal basis for the heuristic determination of minimal cost paths.
IEEE Transactions on Systems Science and Cybernetics 4(2): 100-107.
Jansche, M. 2003. Inference of string mappings for language technology. Doctoral dissertation, Ohio State University.
Knight, K., Nair, A., Rashod, N., and Yamada, K. 2006. Unsupervised analysis for decipherment problems. In
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 499-506.
Liang, P., and Klein, D. 2009. Online EM for unsupervised models. In
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 611-619.
Novak, J. R., Minematsu, N., and Hirose, K. 2012. WFST-based grapheme-to-phoneme conversion: Open source tools for alignment, model-building and decoding. In
Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 45-49.
Novak, J. R., Minematsu, N., and Hirose, K. 2016. Phonetisaurus: exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework.
Natural Language Engineering 22(6): 907-938.