NGramCount

Description

This utility counts n-grams from an input FST archive. This produces a count FST with the same topology as the eventual normalized model, complete with backoff transitions. The option order specifies the maximum order n-gram to count, and the utility counts all n-gram orders less than or equal to the parameterized maximum order. The option --epsilon_as_backoff causes the counter to interpret <epsilon> as a backoff transition while counting, which is only appropriate in very specialized circumstances (see caveats below).

Usage

ngramcount [--options] [in.far [out.fst]]
  --order: type = int64, default = 3
  --epsilon_as_backoff: type = bool, default = false
 
 class NGramCounter(size_t order);
 

Examples

The default counts trigrams, bigrams and unigrams from an input corpus:

ngramcount earnest.far >earnest.3g.cnts


To count trigrams, bigrams and unigrams from a single FST using the library functions:

NGramCounter<Log64Weight> ngram_counter(3);
StdMutableFst *fst = StdMutableFst::Read("in.fst", true);
ngram_counter.Count(*fst);
VectorFst<StdArc> fst;
ngram_counter.GetFst(&fst);
fst.Write("out.fst");

Caveats

Backoff transitions, labeled with <epsilon>, have weight One() in the semiring. By default, the count FSTs are in the tropical semiring, hence backoff weight is 0 and n-gram transitions have weight -log(count).

The --epsilon_as_backoff switch interprets <epsilon> in the input fst archive as a backoff transition. This is only appropriate when the corpus is randomly sampled from a model and shows where backoff transitions were taken. It allows for the use of the presmoothed method in ngrammake. These are not typical scenarios, hence these options should be used with care.

References

-- MichaelRiley - 09 Dec 2011

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r4 - 2011-12-14 - BrianRoark
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback