You can use the formatting commands describes in TextFormattingRules in your comment.
If you want to post some code, surround it with <verbatim> and </verbatim> tags.
Auto-linking of WikiWords is now disabled in comments, so you can type VectorFst and it won't result in a broken link.
You now need to use <br> to force new lines in your comment (unless inside verbatim tags). However, a blank line will automatically create a new paragraph.
I have two ngram models that I would like to merge. One of them has been pruned, and I am having trouble figuring out the correct way to perform that merge. Here is a minimal reproduction of the issue.
We have two corpora. The first (called corpus A) contains strings "hey there alice" and "hey alice". The second contains "hey siri". In each corpus, each string is repeated 10 times (this is just to avoid katz smoothing for counts <=5 to make the example clean). We build ngram models using ngramcount followed by ngrammake. I want to focus on ngrams following "hey". Here is a snippet of the ngram model for corpus A, produced using ngramprint:
<verbatim>
hey 0.2857143
hey <eps> 0.001750044
hey there 0.4995
hey alice 0.4995
...
</verbatim>
Now we want to prune corpus A, removing half-frequency and below entries (in particular, count pruning with count <= 5). This prunes all of the bigrams for "hey", including the epsilon backoff transition. The full pruned model is thus:
<verbatim>
hey 0.2857143
there 0.1428571
alice 0.2857143
alice <eps> 0.0014
alice </s> 0.999
</s> 0.2857143
<s> 0.2857143
<s> <eps> 0.0014
<s> hey 0.999
</verbatim>
Despite the pruning of the backoff transition, the model still works fine. If we use ngramperplexity to compute the perplexity of "hey there alice", it looks correct:
<verbatim>
ngram -logprob
N-gram probability found (base10)
p( hey | <s> ) = [2gram] 0.000434512
p( there | hey ...) = [1gram] 0.845098
p( alice | there ...) = [1gram] 0.544068
p( </s> | alice ...) = [2gram] 0.000434512
</verbatim>
ngramperplexity is correctly using an implicit free (weight=1, -logprob=0) backoff transition from "hey" to get back to the unigram probability for "there".
But what happens when we use ngrammerge to merge in the "hey siri" ngrams? We get this:
<verbatim>
hey 0.6190476
hey <eps> 0.0015
hey siri 0.999
</verbatim>
ngrammerge doesn't see the implicit backoff transition, and so we only get the low-likelihood backoff from the "hey siri" model. Thus, in the merged model, "hey there" and "hey alice" are extremely unlikely continuations for "hey".
It seems like I could fix this issue if I could manually re-introduce explicit backoff transitions into the pruned model. Is there an automatic way to that? Alternatively, is there a better way to perform the merge operation to handle this situation?
Certainly it seems like it would be preferable to merge and then prune, but unfortunately this is not an option in my actual use case, because I do not have access to the unpruned model.
Hi Nick,
sorry for the delay, just noticed the question now -- need a better way of getting notified when there's a question! For your specific ask, there is no way to really re-introduce the backoff arc as you suggest. For the more general question, there are multiple model merging methods, and you might try --method=bayes_model_merge as an option, which does a better job of determining the probability of each model being in a particular state and adjusting the probabilities accordingly. (It's based on the methods in this paper: https://research.google.com/pubs/archive/37567.pdf)
You probably already dealt with this by now, but worth responding for posterity I suppose. Also, please do chime in if you found another workaround.
brian
No worries about the delayed reply! I did end up fixing this, with a somewhat hacky manual workaround. I added dummy [unk] transitions with a weight of -99 (1e-99) and accompanying free backoff transitions to the model before merging, so that what was once an implicit transition that ngrammerge would ignore is instead an explicit transition. The [unk] transitions were necessary because I did the surgery by dumping the model to text and then reading back in with ngramwrite and ngramread. If you just add an explicit backoff with no alternative, ngramread will optimize it away and you'll be back where you started.
I'm currently developing soft keyboard spellchecking & next word prediction feature for Uzbek. Because it is agglutinative, I had to make a rule-based morph. analyzer + cascade of error models. Now, I'm having issues with predicting the next word. Are there any use cases of using ngram FSTs in developing soft keyboards? Or traditional methods by storing language model in ARPA format are more feasible? Thanks in advance.
UPD: I have managed to predict the next word using NGram FST. The answers given previously in this forum by Brian Roark, where it was indicated that we need to produce special traversal algorithm for predicting the next word helped a lot. The algorithm is trivial - we run the DFS search on the transducer, and traverse the graph until we consumed the last symbol (but the state may not be final). After that, we collect all next transition symbols and return it to the user. An example of my training data (in Uzbek):
zohid <NN> <sp> sobiq <JJ> <sp> boshliq <NN> <Si> <ning> <sp> gap <NN> <lari> <ni>
Since Uzbek is agglutinative language, I had to analyze each word using the morph analyzer and tagger. After that, I represented each root and meta-morpheme as equal tokens with special <sp> token for space.
So, now the model predicts not only the next root, but may predict the next affix. Using this in a soft keyboard is slow compared to other well-established keyboard, but does its job. I wonder if there are any industry-proven finite-state techniques for modeling highly agglutinative languages, not only next-word prediction, but also spell and grammar checking.
Thank you for sharing your interesting experiences, and sorry for not replying at the time -- I missed your post. What you describe sounds like a good solution. As far as industry-proven techniques, I guess the one thing you might look at is sub-word modeling, where the sub-word 'pieces' are derived empirically from a large corpus. It does presuppose a large corpus, which you may not have, but it would have the virtue of being robust to any lack of coverage of the morph analyzer.
I'm having an issue piping input to ngrammerge, using a command like:
$ fstprint 116.mod | ngrammerge --normalize --v=10 - 103.mod test.merged
gives ERROR: ExpandedFst::Read: Can't open file: -
and if I remove the - as follows:
$ fstprint 116.mod | ngrammerge --normalize --v=10 103.mod test.merged
it will just print the usage information (likewise if I specify `-ofile=test.merged`). It looks like this behavior is based off of the check here: https://www.opengrm.org/doxygen/ngram/html/ngrammerge-main_8cc_source.html#l00116.
Is there any way to have it parse stdin correctly?
Sorry, because of the way the merge algorithm works, it needs both files. Also, it takes the models in their binary format, so you don't need to use fstprint there. The command would be:
$ ngrammerge --normalize --v=10 116.mod 103.mod test.merged
brian
Generating a list of the most probable next words along with their probability
This is a question similar to "ngramrandgen with a prefix one, but with a different angle.
Given a trained model and the first few words in a sentence, I'd like to output an n-best list of the most probable words alongside with their (log)probs. For example:
$ echo 'IT IS SIMPLY A' | ngrambestlist earnest.mod
VERY 0.804171
INEXCUSABLE 0.357274
MANNER 0.1345083
How can I implement ngrambestlist, either as a script or a C++ program? Thanks!
Bad formatting... I meant for the code section:
<verbatim>
$ echo 'IT IS SIMPLY A' | ngrambestlist earnest.mod
VERY 0.804171
INEXCUSABLE 0.357274
MANNER 0.1345083
</verbatim>
Hi,
sorry for the delay, just noticed this question. There are two steps to this: 1) determine the state of the FST model corresponding to the prefix; and 2) collecting the k-best continuations from that state. For the first step, one starts in the start state of the FST and moves to the next state of the arc labeled with the first word, and so on. If there is no such arc, then take the failure transition and try again. The destination state at the end of this process (repeated over the words in the prefix string), is where one collects the probabilities. All arcs leaving that state (other than the backoff arc) have the direct negative log probabilities of the words labeling them, and the backoff arc (labeled with index 0) has the backoff weight. One then follows the backoff arc to a new state and collects probabilities for words that haven't been collected yet, by adding the backoff weight and the negative log probabilities on the arcs labeling the words. Then again follow the backoff arc, adding the weight of that arc to the backoff weight and continuing as above. By recursively following this until there is a state with no backoff arc, one will collect all the probabilities of all of the words in the vocabulary. Then sort and output.
This would have to be written in C++, but as you can see, the algorithm is a relatively straightforward walking of the automaton structure. One must make sure to only collect backoff probabilities for unseen words, but that's the one real wrinkle that needs to be attended to.
One could try intersecting the model with a lattice representing the prefix followed by all words and some suffix, but then the cost of the suffix would have to be accounted for, which can get messy with weight pushing during composition, so direct implementation is recommended.
brian
I am trying to build a transliteration WFST using this tool. For the pair language model, is it possible to eliminate/reduce epsilon symbols by merging multiple symbols on the input or the output?
Yes, you can augment your EM algorithm to have multiple characters on the input or output side of the pair symbols. That will reduce the overall number of epsilons, though at the cost of potentially reduced coverage, since there will be more atomic symbols and less opportunity to back off in the model. This may be a good or bad thing depending on the amount of training data. It will not, however, eliminate epsilons, since they are still used for backoff, unless you want to try an unsmoothed model, which again will impact coverage. But generally you should think about pair symbol merger in the training data that you give to OpenGrm to build the model, e.g., after EM has been used to provide symbol-by-symbol alignments. Hope that helps!
Thank you for your insights Brian.
> But generally you should think about pair symbol merger in the training data that you give to OpenGrm to build the model, e.g., after EM has been used to provide symbol-by-symbol alignments.
Not sure I understood correctly - are you suggesting the mergers should be done after the alignments (rather than during the alignment with EM?).
There are many ways to do such a symbol merger, but, yes, EM is the main way to do it. We often use EM in 2 stages: first performing single character to single character alignments, then restricting pair symbol vocabulary to only those that can arise from observed neighboring pair symbol merging (along with existing atomic pair symbols) before a second round of EM. If the key goal is epsilon reduction, then a more constrained second round (with fewer possible merged pairs) should yield a smaller symbol set. (And frankly, non-EM heuristic symbol merging methods would likely yield comparable results.)
There is no general utility to do this, but since the model is stored as an FST, you can compose with a transducer to create a model that is encoded with the byte or utf-8 indices. So, for example, the word "abc" consists of the bytes (or utf-8 characters) 97 98 99, and "xyz" is 120 121 122, so a transducer can be constructed as follows (standard text format):
01abc97
1 2 <epsilon> 98
2 0 <epsilon> 99
0 3 xyz 120
3 4 <epsilon> 121
4 0 <epsilon> 122
0
compiled using the LM symbol table on the input side and no symbol table on the output side.Create separate paths for every word in your LM lexicon, then compose with the LM, project to output side (using fstproject) and this is a model at the appropriate level.The one tricky part here, however, is word-boundary, which is implicit in the words.One way to handle that is to append whitespace (byte 32) at the end of each word, so for example "abc" would be 97 98 99 32.However that would then require whitespace at the end of the full string (sentence).Alternatively, you could create a special end-of-word token </w> that abstracts from the whitespace issue.
But, in any case, not something particularly straightforward to turn into a general utility, so you'll need to roll your own to do this.
hope this helps.
brian
I used python code in my program and libfst.so (c++ library).
I have a problem when I build library which use ngram-output
Build so with no problem but runtime show me :
"undefined symbol: FLAGS_ngram_error_fatal"
Error this line
#include <ngram/ngram-output.h>
it is not a problem when I build with out include ngram-output.h . This thing make me determine ngram-output.h exist a problem.
Thanks.
This sounds like maybe what will happen if you don't link against the SO but use the header. you need to link to the libngram.so once you've built it. Beyond that, I can't say -- haven't seen this problem before.
Estimating the perplexity of a language model with OOV
I am trying to compute the perplexity of a language model on a test set but I do not understand how I should handle OOV words.
Right now I am using the following set of commands:
<verbatim>
cat train.txt test.txt > tmp
ngramsymbols < tmp > voc.syms
farcompilestrings -symbols=voc.syms -keep_symbols=1 train.fr > train.far
ngramcount -order=3 train.far > train.cnt
ngrammake train.cnt > train.mod
farcompilestrings -symbols=voc.syms -keep_symbols=1 test.fr > test.far
../bin/ngramapply en3gram.mod test.far
</verbatim>
but as soon as the test corpus contains an OOV, the FST returned by the ngramapply command are empty.
What did I did wrong ?
Thanks
Hi Guillaume, sorry for the slow turnaround, just getting notified about your question now. Even though you added the test words to the symbol table, the fact that they had no observations in the training set means that they have no ngrams in the resulting model automaton, hence the empty result. What you need to do instead is to build your symbol table from just the training text, and then when you use farcompilestrings on the test section, you need to tell it what the out-of-vocabulary symbol is, with the switch --unknown_symbol="<unk>" (or whatever you make the unknown symbol -- ngramsymbols creates <unk> by default).
Compiling 1.3.4 with openfst 1.6.9 in non-standard location
I am trying to upgrade to the latest versions but don't want to install in /usr/local right now. When I run ./configure --prefix=<mydir> for opengrm after configuring and installing openfst in that location, I get the following error when running make.
libtool: compile: g++ -DHAVE_CONFIG_H -I./../include -std=c++11 -MT ngram-context.lo -MD -MP -MF .deps/ngram-context.Tpo -c ngram-context.cc -fPIC -DPIC -o .libs/ngram-context.o
ngram-context.cc: In static member function 'static void ngram::NGramContext::ParseContextInterval(const string&, std::vector<int>*, std::vector<int>*)':
ngram-context.cc:128:3: error: 'SplitString' is not a member of 'fst'
fst::SplitString(line, ":", &contexts, true);
^
ngram-context.cc:131:3: error: 'SplitString' is not a member of 'fst'
fst::SplitString(contexts[0], " ", &labels1, true);
^
ngram-context.cc:132:3: error: 'SplitString' is not a member of 'fst'
fst::SplitString(contexts[1], " ", &labels2, true);
^
ngram-context.cc: In static member function 'static void ngram::NGramExtendedContext::ParseContextIntervals(const string&, int, std::vector<ngram::NGramContext>*)':
ngram-context.cc:234:3: error: 'SplitString' is not a member of 'fst'
fst::SplitString(line.get(), ",", &context_patterns, true);
^
Any idea how to solve this problem? It does seem to be pointing to the correct include directory.
Hi for anybody still looking for an answer, you need to change the version of OpenFST to a version that has SplitString, you can check which version of OpenFST has SplitString by searching in the directory of each package available at http://www.openfst.org/twiki/bin/view/FST/FstDownload. AFAIK 1.6.6. and above has it. The version of ngram in install_opengrm.sh can remain the same 1.3.7 (newer versions such as 1.3.10 will show other problems). Good luck
Is there an established way to re-score a plain old fst (say, lattice) with an n-gram model fst? Simple composition would not work as it does not handle back-off epsilon arcs - among other problems.
Yes, the binary command line ngramapply takes an input fst archive (far) and composes each fst in the archive with the given language model. By default, it uses a failure (phi) arc interpretation of backoffs (though you can also choose epsilon if you would like to compare the result). You can look at that code if you want to see how to 'apply' the n-gram model to your lattice in your own C++ code.
convert backoff model from Approximate Offline Representation to Exact Offline Representation
Hi,
The "GRM Library" from AT&T (http://www.openfst.org/twiki/bin/view/Contrib/GrmLibrary) used to support a tool called "grmconvert" to convert backoff model from Approximate Offline Representation to
Exact Offline Representation. That algorithm was described in the paper "Generalized Algorithms for Constructing Statistical Language Models".
Since openGRM is created by the same authors, I am wondering if that same functionality is supported in openGRM.
If not, is there a reason that we don't need that?
Thanks,
Min
Hi Min,
great memory! no, we haven't implemented that algorithm in OpenGrm, but now that you mention it, maybe we should! The resulting automaton would not conform to the 'canonical' format of ngram models in this part of the software library, so it may reside more properly in the SFst part of the library (http://openfst.cs.nyu.edu/twiki/bin/view/GRM/SFstLibrary) which is under development. I've always had a soft spot for that algorithm -- and I guess that makes two of us;-)
Brian
As far as whether or not we need it, the answer seems to be: not usually. If the use of the LM is off-line, e.g., for composition with a pronunciation lexicon and subsequent determinization and weight pushing, then usually the standard epsilon approximation is good enough that the cost of automaton growth required for the exact off-line is not offset with enough benefit in accuracy to warrant it. One common work-around -- if exact LM probabilities are required -- is to produce a lattice, strip off the scores, then apply the language model using failure transitions, which is on-line exact and can be used when composing with a lattice. So, yeah, nice algorithm, somewhat limited utility...
Hi Brian,
Thanks a lot for the explanation, and I do hope you can implement that function someday
The reason I asked for the algorithm is, we recently noticed that in our language model in production (very big one, and trained from a large amount of text data), at least 15% of the existing 4gram has a higher cost than their corresponding backoff path.
You can argue maybe our training data is too dirty, and we were not using the best training parameters such as cutoff, discounting methods, and pruning threshold. But in my opinion, the "approximate offline represenation" will always have this kind of problem, as long as we are using backoff.
15% is a number that is big enough to raise a concern for its impact on accuracy, during the first pass, before lattice rescoring.
>> usually the standard epsilon approximation is good enough that the cost of automaton growth required for the exact off-line is not offset with enough benefit in accuracy to warrant it
It would be great if you could point me to some experimental reports showing the difference in accuracy between exact off-line and approximate off-line model
Anyway, it looks like the "approximate offline model" is the main stream approach right now, based on all the recognizers I know
Thanks,
Min
I see. yeah, possibly you have a scenario where it makes a difference. I don't recall any paper offhand that demonstrates the conventional wisdom explicitly, but for a related topic (the use of a lexicographic semiring to ensure the Viterbi best with epsilon is identical to that using failure transitions: http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00198), we did show (in section 2.3) that the Viterbi best in a realistic (for the time) speech task did differ between approximate offline and exact online relatively frequently, though this was just a lattice composition experiment. I do not believe the use of this changed WER at all, and this was simply to demonstrate equivalence of the lexicographic semiring encoding and the failure arc encoding wrt. Viterbi. Anyhow, hope that helps.
Error when trying to read ngram FST using python extension
Hi,
I am receiving the following errors when trying to read an FST of type ngram using the python extension:
<verbatim>
ERROR: GenericRegister::GetEntry: /usr/local/lib/fst/ngram-fst.so: undefined symbol: FLAGS_fst_error_fatal
ERROR: Fst::Read: Unknown FST type ngram (arc type = standard): mdl
</verbatim>
I have added /usr/local/lib/fst to LD_LIBRARY_PATH.
Any help would be greatly appreciated!
Thanks in advance,
Arvid
When you installed openfst, did you enable the ngram FST extension? (Note that this is not really an opengrm question, but an openfst question, since the ngram fst type is a compact format that is an extension of openfst, independent of the ngram library.) To do this, when you are installing openfst, you should run configure with the --enable-ngram-fsts switch. That will place the ngram-fst.so in your /usr/local/lib/fst, and it that should do it. you may need to re-install openfst if you have installed it without that extension and need that FST type.
Sorry, I should have posted this in the openfst forum. Feel free to move this post.
Yes, I ran <verbatim>$ ./configure --enable-ngram-fsts --enable-python --enable-far</verbatim> and I have /usr/local/lib/fst/ngram-fst.so.
And I am able to run <verbatim>fstconvert --fst_type=ngram mdl mdl_ngram</verbatim> without any problems. It only seems to be the python extension that is complaining.
Thanks,
Arvid
This isn't tested behavior, and if it's not working from within Python, I'm not sure how to make it work quickly, short of statically linking ngram-fst.so with pywrapfst.so. Sorry, I'll look into it, but I'm not sure where to begin.
Hi there,
I am trying to remove low-frequency unigrams (cnt == 1) and affected bi-,tri-grams from my 3-gram model. I thought the following command would do that:
But it does nothing.
The "count_pattern" is documented only in the source code, implying it works on decimal counts. Which is a bit confusing, as ngramcount is log-based. A "--method=count_prune" example would be of great help in the documentation.
On a sidenote, my log says:
INFO: State ID: 0; 164 arcs; -log(sum(P)) = -5.58725, should be 0
Well, ngroumcount produces raw count FSTs, so I wonder if the count_pattern argument for ngramshrink is recognized at all...
For some reason, the interface is translating your angle brackets into html code, hence it doesn't recognize the <verbatim> </verbatim> command. I edited your comment to get the real angle brackets in there, but don't have a good answer for you moving forward on that.
On the count pruning side, I do have a good answer for you. There are a couple of considerations here. You called the count pruning with the --count_pattern set to 1:2, i.e., prune unigrams with count less than two. There are two problems with this. First, ngramshrink does not prune unigrams, only higher order n-grams. If you want fewer unigrams, you should build a vocabulary and compile your training corpus to map everything else to <UNK>. Second, even if we did prune unigrams, none of them would be pruned in your trigram model, because we do not prune an n-gram if it is either a prefix or a suffix of another n-gram which is left in the model. Since you are not pruning any bigrams or trigrams (your count pattern just specifies unigrams) then all unigrams would be a prefix or suffix of some other n-gram being left in the model. So that's a second reason why nothing would be pruned.
If you want to prune higher order singletons from the model, use a --count_pattern=2+:2 instead. That will prune all bigrams and higher with count less than 2. If you just want to map low frequency unigrams to <UNK>, then I would recommend 1) counting unigrams from your corpus.far; 2) using ngramprint to print the counts; 3) selecting only those words that you want to keep from those counts, and putting those words into a file; 4) using ngramsymbols to generate a symbol table for that subset of words; and 5) recompiling a new corpus.far with that symbol table, so that everything else is mapped to <UNK>. yeah, sort of convoluted, but that should do the trick for you.
brian
Hi Brian,
Thanks for the great answer! Also for editing my comment.
I was kind of hoping that ngramshrink would take care of higher order n-ngrams.
The problem with --count_pattern=2+:2, is that it does not remove any unigrams, plus in my case this removes 70% of bigrams, and 80% of trigrams -- a lot more than originally wanted. The <UNK> approach closer to the current needs, I may try that.
In the meanwhile I used pywrapfst to 1) enlist the uni-grams to remove 2) deleted all arcs from the fst with unwanted labels (including higher orders) 3) ''connect'' the resulting FST. This also seem to work... not tested though.
gbr
Hi,
I have existing ngrams with their counts, for example use:
ngram.txt:
a 12
a b 22
a b c 34
b 23
b c 10
c 4
I would like to create a language model using opengrm, so I am reading it using ngramread:
ngramread ngram.txt > ngram.mod
however, now, if I do:
ngraminfo ngram.mod
I get:
FATAL: NGramModel: Empty automaton
What am I doing wrong?
Thanks,
Shlomi
Hi Shlomi,
For an FST-based language model, we need to have end-of-string (</s>) counts, which you are missing. By default, the start-of-string and end-of-string symbols are assumed to be <s> and </s> but you can change that with the --start_symbol and --end_symbol flags. For example, in your counts, perhaps "a" is the start-of-string symbol and "c" is end-of-string. Then ngramread --start_symbol="a" --end_symbol="c" ngram.txt >ngram.mod would yield a non-empty automaton. However, since start-of-string is represented in the automaton by a state (the start state) and end-of-string is a final cost, your compiled model will lose the information that "a" and "c" represent those tokens. Those are merely used as a printing convenience, and are not represented explicitly as symbols in the automaton.
As to why the end-of-string counts or probabilities are required to compile the FST, note that the FST itself is an acceptor of strings in the language, and with no end-of-string count, there is no probability mass for ending strings, i.e., finite length strings have zero probability and such an automaton accepts no strings in the language.
To see what the counts should look like, try building a model on the single string "a b c" and then using ngramprint. Hope this helps!
Thanks Brian! Great explanation, I managed to get it to work
I have another question for you: I dont really have the counts of the start/end of these "sentences" (its sorta irrelevant in our context.. or at least we think so). So I am making synthetic <s> and </s> grams in order to build this fst. Now, say I have "a b c" with all of their sub-grams counts, and I am turning it into "<s> a b c </s>". Should I give the synthetic "<s> a" and "c </s>" grams a zero count, or should I treat them as "a" and "c" counts respectively? I hope I am making any sense..
Not sure I have much guidance to provide here. One thing to keep in mind is that the resulting language model automaton is an FST that can be manipulated with openfst. So, you could write the code to open the automaton and modify the costs as needed after it has been built, however best suits your needs. However, probably the method that would introduce the least changes to your idea of what you'd like your model to look like would be to create just a single end-of-string unigram </s> with low count -- not zero, but maybe some epsilon, e.g., 0.001. If you include <s> and </s> as you have in your example, you'll impact the model in many places, whereas a single epsilon count unigram won't change the model much at all.
Because the models are FSTs, we can build an FST to restrict the LM to only strings with the particular prefix. Here's what I did, using a model I had built called earnest.3g.mod.fst, built from our example corpus earnest.txt. For standard ngramrandgen, here's what we get:
ngramrandgen -remove_epsilon --max_sents=3 earnest.3g.mod.fst rg.norestrict.far
farprintstrings rg.norestrict.far
I DON T ALLOW I FEEL THAT MY DEAR UNCLE JACK DOES NOT MR BUNBURY FOR YOUR SHALL PROBABLY NEVER AN AUNT BEING AT THE ARISTOCRACY
YES ANYTHING CAST HATE PEOPLE
THEY SO GRIEVED YES SIR
Now suppose I want to restrict to strings prefixed with the string "THE ROOM". First I will build an FST that accepts all strings from my vocabulary (including with <epsilon>) prefixed by "THE ROOM". Here's a bash/awk sequence of commands to build such a restriction FST:
echo "THE ROOM" |\
while read i; do echo "$i" | wc |\
while read a st c; do echo "$i" |\
awk '{for (i = 1; i <= NF; i++) {printf("%d\t%d\t%s\t%s\n",s,s+1,$i,$i); s++}}' \
>restrict.txt;
cat earnest.syms |\
awk -v ST="${st}" '{printf("%d\t%d\t%s %s\n",ST,ST,$1,$1)}' >>restrict.txt;
echo "$st" >>restrict.txt; done; done
fstcompile \
-isymbols=earnest.syms -keep_isymbols \
-osymbols=earnest.syms -keep_osymbols restrict.txt restrict.fst
If you look at restrict.txt, you'll see it looks something like this:
0 1 THE THE
1 2 ROOM ROOM
2 2 <epsilon> <epsilon>
2 2 MORNING MORNING
2 2 ROOM ROOM
2 2 IN IN
2 2 ALGERNON ALGERNON
...
2
i.e., a loop of words at state 2, which is the final state. This automaton accepts all strings made up of words from the vocabulary, prefixed by "THE ROOM".
Now we can compose our restriction FST restrict.fst with our model, to produce a new model:
fstcompose --compose_filter=null restrict.fst earnest.3g.mod.fst >earnest.restrict.3g.mod.fst
We use compose_filter null so that it treats <epsilon> just like another symbol. Now we can randomly generate from this, and we get:
ngramrandgen -remove_epsilon --max_sents=3 earnest.restrict.3g.mod.fst rg.restrict.far
farprintstrings rg.restrict.far
THE ROOM YOU ARE YOU ARE AS ANY OUT
THE ROOM IS MARRIED TO I I AM THINK MARRIED LADY BLOXHAM THROUGH MOMENT A FIRST CONFESSED TO YOU HAVE WORTHING I MERELY CAME BACK
THE ROOM NEXT WEEK COUNTIES MONEY
So, yeah, no 'easy' command to provide such a prefix, but with some FST processing, you can get what you want.
logp values in NGramUnsmoothed with backoff=true versus false
I am trying to redo ngrammake --method=="unsmoothed" using ngram::NGramUnsmoothed , but I got sightly different results depending on the backoff argument for the constructor.
<verbatim>
ngram::NGramUnsmoothed* ngram = new ngram::NGramUnsmoothed(
&counts,
backoff, /* true | false */
prefix_norm, /* false */
backoff_label,
norm_eps, /* 0.001 */
check_consistency);
</verbatim>
Provided there is a single "a bc" sequence in my toy language, with backoff=true
<verbatim>
p(bc|a)=-0
</verbatim>
but with backoff=false I have
</verbatim>
p(bc|a)=4.260033e-45.
<verbatim>.
fstequal returns true, as practically there are no differences between the two models. I may be missing something obvious but I feel puzzled how what backoff means for an unsmoothed ngram, and how it marginally affects ngram probabilities. Is it just an implementational detail I can ignore?
<pre>
ngram::NGramUnsmoothed* ngram = new ngram::NGramUnsmoothed(
&counts,
backoff, /* true | false */
prefix_norm, /* false */
backoff_label,
norm_eps, /* 0.001 */
check_consistency);
</pre>
(just an attempt to make the code readable using pre instead of verbatim)
Hi. Each smoothing method can be either backoff (--backoff=true) or interpolated (--backoff=false). If it is backoff, it relies upon the discounted and normalized value of the observed n-grams alone. If it is interpolated, it mixes the higher order estimates with the appropriate mixing factor that ensures normalization after discounting. For each smoothing method, these will result in somewhat different models. For unsmoothed, there is approximately zero probability for the backoff transitions; however, since we don't use infinite cost arcs in these transducers, we use an epsilon probability value instead. For that reason, you get slightly less than probability 1 (i.e., slightly more than cost 0.0, which is the negative log probability) for your observed toy n-gram. Hope that helps!
Dear Brian, very clear explanation, thank you! So, --backoff=false means interpolated, and interpolated "mixes in" lower order grams by default, and the probability for arcs to reach lower order ngrams happens to be tad bit different from zero. All clear.
I have a problem using ngrammake (OpenGRM v1.2.2 with OpenFST v1.5.1). Many of the smoothing methods produce errors, including the unsmoothed method. They all report a NegLogDiff fatal error which doesn't appear to be due to floating point precision, sometimes after some intermediate warnings or errors. My ngram counts are in the range [1, 1803827]. Does anyone have any idea what might be causing these problems?
<verbatim>
ngrammake --method=katz -v=100 counts.grm model.grm
INFO: FstImpl::ReadHeader: source: counts.grm, fst_type: vector, arc_type: standard, version: 2, flags: 3
INFO: Histograms violating Good-Turing assumptions
INFO: Histograms violating Good-Turing assumptions
INFO: Histograms violating Good-Turing assumptions
INFO: Histograms violating Good-Turing assumptions
Count bin Katz discounts Counts (1-grams/2-grams/3-grams)
Count = 1 0.321904/0.338472/0.226025
Count = 2 0.999/0.586256/0.501859
Count = 3 0.999/0.710776/0.654064
Count = 4 0.999/0.776459/0.737903
Count = 5 0.999/0.823323/0.795443
Count > 5 0.999/0.999/0.999
FATAL: NegLogDiff: undefined inf -0.996128
ngrammake --method=witten_bell -v=100 counts.grm model.grm
INFO: FstImpl::ReadHeader: source: counts.grm, fst_type: vector, arc_type: standard, version: 2, flags: 3
INFO: lower order sum less than zero: 21 -0.314675
FATAL: NegLogDiff: undefined 0 -3.14155
ngrammake --method=unsmoothed -v=100 counts.grm model.grm
INFO: FstImpl::ReadHeader: source: counts.grm, fst_type: vector, arc_type: standard, version: 2, flags: 3
INFO: lower order sum less than zero: 3 -inf
INFO: new lower order sum: 3 nan
INFO: lower order sum less than zero: 4 -inf
INFO: new lower order sum: 4 nan
INFO: lower order sum less than zero: 5 -inf
INFO: new lower order sum: 5 nan
<snip: more of similar>
INFO: new lower order sum: 80 nan
INFO: lower order sum less than zero: 88 -inf
INFO: new lower order sum: 88 nan
INFO: lower order sum less than zero: 90 -inf
INFO: new lower order sum: 90 nan
INFO: lower order sum less than zero: 92 -inf
FATAL: NegLogDiff: undefined 0 -inf
ngrammake --method=absolute -v=100 counts.grm model.grm
INFO: FstImpl::ReadHeader: source: counts.grm, fst_type: vector, arc_type: standard, version: 2, flags: 3
Count bin Absolute discounts Counts (1-grams/2-grams/3-grams)
Count = 1 0.374582/0.697101/0.786205
Count > 1 0.374582/0.697101/0.786205
FATAL: NegLogDiff: undefined inf -0.885087
</verbatim>
This problem is occurring only when printing the counts to a text format and reading them back in (following the answer to the "Can ngramcount ignore OOVs?" question); the problem occurs even if <unk>s are not removed.
Given counts1.grm, an FST produced by ngramcount, I would have thought "ngramprint counts1.grm | ngramread - counts2.grm" would result in counts2.grm being identical to counts1.grm, but this isn't true in general. With the earnest.cnts example the new version is only slightly different in terms of file size. For part of my own corpus the difference in file size is much more substantial. Could this difference be due to symbol table changes only, or could the procedure change the FST in other ways?
Without seeing the specifics of your model, I would speculate that when you are filtering your n-grams, there are some prefixes or suffixes of included n-grams that have been pruned. For example, if you leave in the n-gram xyz, then the FST topology needs xy in the model (to be able to reach the history state from the unigram state) and yz in the model (to back off to). These n-grams can be re-introduced into the topology if they are missing, but the count of the n-gram will be zero. Kneser-Ney modifies the counts of lower order n-grams based on the number of higher order n-grams with that as its suffix, so the missing counts are repaired in that method, avoiding the error. I would have thought that your method for filtering would result in prefixes and suffixes being retained, but their absence would explain the behavior you are seeing. It also explains the size differences. There can be small file size difference in some cases, with your round trip, because some operations on the model (pruning, for example) will remove 'useless' states (histories that only backoff), while the reading of n-grams will build those states in the course of building the model topology and will not have removed them. In your case, though, the large differences seem to indicate that many 'needed' ngrams (prefixes and suffixes) are being added. You might try the following round trip to debug: ngramprint counts1.grm >counts1.ngrams.txt; ngramread counts1.ngrams.txt | ngramprint - >counts2.ngrams.txt. Then compare the ngrams that result, which will include the costs. This should show you what ngrams that ngramread had to 'hallucinate' to achieve a canonical ngram topology. Hope that helps, and thanks for bringing up these issues!
The issue occurs even when I don't filter the ngram counts. Here's a demo using nothing but the The Importance of Being Earnest sample. In an empty directory, using OpenGRM 1.2.2 and OpenFST 1.5.1...
<verbatim>
wget http://www.openfst.org/twiki/pub/GRM/NGramQuickTour/earnest.txt
ngramsymbols earnest.txt earnest.syms
farcompilestrings -symbols=earnest.syms -keep_symbols=1 earnest.txt earnest.far
ngramcount -order=5 earnest.far earnest1.cnts
ngramprint earnest1.cnts earnest.cnts.txt
ngramread earnest.cnts.txt earnest2.cnts
ngrammake --method=witten_bell earnest1.cnts earnest1.mod
ngrammake --method=witten_bell earnest2.cnts earnest2.mod
</verbatim>
This all works fine until the final command which reports the error "FATAL: NegLogDiff: undefined 0 -0.606982". Note that earnest.cnts.txt was not altered in any way, we simply ngramprint'ed earnest1.cnts and then ngramread it straight back in. This appears to have the same symptoms as the problem I see with my own corpus. It doesn't look like a numeric precision issue.
I've discovered the problem does not arise if I print the counts in ARPA format:
wget http://www.openfst.org/twiki/pub/GRM/NGramQuickTour/earnest.txt
ngramsymbols earnest.txt earnest.syms
farcompilestrings -symbols=earnest.syms -keep_symbols=1 earnest.txt earnest.far
ngramcount -order=5 earnest.far earnest1.cnts
ngrammake --method=witten_bell earnest1.cnts earnest1.mod
ngramprint --ARPA earnest1.cnts earnest.cnts.arpa
ngramread --ARPA earnest.cnts.arpa earnest2.cnts
ngrammake --method=witten_bell earnest2.cnts earnest2.mod
ngramprint earnest1.cnts earnest.cnts.txt
ngramread earnest.cnts.txt earnest3.cnts
ngrammake --method=witten_bell earnest3.cnts earnest3.mod
In the above, earnest1.mod and earnest2.mod are created successfully but the ngrammake for earnest3.mod fails with an error like "FATAL: NegLogDiff: undefined 0 -1.38214".
Stripping <unk>s out of the ARPA format is not as easy but it does work so I have a workaround.
This problem appears to be fixed in the latest version of OpenGRM. Running the reproduction steps above while using OpenGRM v1.3.1 and OpenFST v1.5.3 does not produce an error.
Thanks!
Can ngramcount be used in a way that produces counts only for ngrams that do not contain OOVs/the unknown sybol? Or can ngrams containing OOVs/the unknown symbol be stripped out afterwards?
There is no single option to do this, but it is absolutely possible by: 1) producing counts which contain the <UNK> symbol (see response just below); 2) removing ngrams with <UNK>; and 3) making the model from the resulting counts. To make this happen, you can print your counts to a text file, remove n-grams with <UNK> using grep, recompile to counts, then proceed. I did this whole process as follows:
head -1000 earnest.syms >earnest-1k.syms
echo -e "<UNK>\t1000" >>earnest-1k.syms
farcompilestrings -symbols=earnest-1k.syms -unknown_symbol="<UNK>" -keep_symbols=1 earnest.txt >earnest.far
ngramcount --order=3 earnest.far >ngcounts.3g.fst
ngramprint --integers ngcounts.3g.fst | grep -v "<UNK>" >ngcounts.3g-nounk.txt
ngramread ngcounts.3g-nounk.txt >ngcounts.3g-nounk.fst
ngrammake ngcounts.3g-nounk.fst >nglm.3g-nounk.fst
hope that helps.
IRSTLM supports LM filtering, i.e. restricting the vocabulary of an LM to a target word list. For example <verbatim>compile-lm train.lm filtered.lm --filter=list</verbatim>. What is the best way to achieve this using OpenGRM?
You can absolutely limit the vocabulary by specifying the FST format symbol table using the --symbols option in farcompilestrings. If the vocabulary does not cover every symbol in the corpus, you'll have to specify an unknown symbol (via the --unknown_symbol command line option), e.g,. <UNK>, which every out-of-vocabulary item will be mapped to. Once the strings in the corpus have been mapped into the FAR file using farcompilestrings, OpenGrm just treats <UNK> as another word.
Thanks Brian. I was actually thinking about filtering an existing LM. IRSTLM allows one to estimate an LM on a corpus with large vocabulary X then later filter the large LM to smaller vocabulary Y. This avoids needing to recompute ngram counts.
Hey Daniel. oh, ok, I misunderstood. Since the LM is an FST, you can 'filter' the LM by composing with a single state automata that includes a looping arc for each symbol you want to retain. You can specify this in text as follows:
0 0 just just
0 0 these these
0 0 words words
0
Then use fstcompile to produce your filtering automaton. If you compose this with your LM FST, then use fstconnect (to prune out any paths that don't lead to a final state) then you'll have an LM restricted to that vocabulary. It will not, however, renormalize the model, it will simply give the original probabilities. If you need to renormalize, then I would suggest using the above method of printing the counts, stripping out what you don't want and recompiling. Hope that helps!
Just a questions. The output of the composition is a plain FST not an ngram (i.e. input/output matchers are missing). Is there a way on the commandline to convert it to an ngram model? ngramread results in segfault.
Hi,
In my training corpus, I have paragraphs, with many sentences following each other. I would like to be able to keep the context across these sentences.
It would be equivalent to computing an ngram on a sentences formed with this pattern: <s> (word+ <s>). And here I would have only one symbol for start/end of sentence (so no </s>).
Is it possible to do this with OpenGrm?
Best regards.
Basically this unique start/end of sentence symbol should behave like any other word, except that it should also be a start and end state of the fst I think.
I thought about having a different symbol explicitly in my text like <period>, but then <s> and </s> would not get the right ngram counts.
Hi,
Haven't really done this before, but should be possible, though you'll need to do some FST modification after training, so it requires a bit of FST know-how. Yes, having a single long string with a sentence delimiter token is the right way to go, something like <period>. Pad the beginning and end of your corpus with k <period> tokens, where k is the ngram order of your model. Once the n-gram model is build with OpenGrm, you'll want to make the bigram <period> state the start state, either by explicitly doing this or by replacing all arcs leaving the start state (including backoff) with a single (free) epsilon arc to the bigram <period>state. Note that the final cost at that state will be quite high, since you only really end the string at the end of the corpus, which makes the model a bit weird, so maybe pad any string you are scoring with multiple <periods> to get the right exit probability. As for replacing the arc, a quick way to do this is to use fstprint to create a textual representation of the ngram model. The first printed arcs will be leaving the start state. The epsilon arc from there points to the unigram state, and the <period> arc from the unigram state will point to the bigram state that you want to be your new real start state. Then just get rid of all the start state arcs and add just one with epsilons to the desired target state, all in the textual representation. Then use fstcompile to compile the model. Quite a bit of a work around, but should give you the functionality you want.
Thanks. It's a bit convoluted, I hoped something simpler! So there is no way of specifying the sos/eos words when building the ngram? If I could set both of them to be <period> that could solve my problem.
If I wanted to modify the source code to do that, where should I look at?
sos/eos are not words/symbols in the FST model. sos is represented by the start state, and eos is represented by the final cost. You can get closure (representing many strings in sequence) by using fstclosure, which will simply create epsilon arcs from final states back to the start state. But that won't capture cross sentence dependencies (i.e., P(the | end <period>)) which is what I think you want. In essence, these are no longer sos/eos, because the 'string' that you are modeling is the paragraph as a whole. Actually, now that I think of it, if it is paragraph that you want to model, the sos/eos would just be start-of-paragraph and end-of-paragraph and you could simply concatenate the sentences together with a <period> delimiter and that would be fine. The probabilities at start-of-paragraph would not be the same as at the start of a sentences (which is fine, because there is no previous sentence in the paragraph to condition the probability on), and the probability of end-of-paragraph would be a separate prediction in addition to end of sentence. But that would be a proper 'paragraph' granularity model that should just work. If you want start of paragraph to be the same as the bigram start of sentence history, that's when all the FST hacking starts, but isn't obvious to me that this is absolutely needed. As far as changing the way start/stop of string works in the library, the concept of start state encoding <s> and final cost encoding </s> is pretty central to the whole thing, might as well rewrite from scratch. hope that helps.
Thanks for your help, I see the issue.
The use would be to score an arbitrary number of sentences in a row without the concept of paragraph. So I'd like just to model the transition between these sentences, to have the right sos/eos context not only at the beginning and end of this batch, but also across the sentences.
For example the score of "<period> Hi <period> How are you <period>" should be higher than "<period> Bye <period> How are you <period>".
I used SRILM in the past which allows not to have counts for <s> and </s>, and I wanted to switch to something a bit more modern, but maybe that OpenGrm is not adapted to my use.
I thought about this a bit more, and I think that I won't always have sentences starting and ending with <period> at test time, so I don't actually must have <period> as sos/eos.
In SRILM there are flags -no-sos and -no-eos, which allows not to generate counts for <s> and </s>. Then if they are in your vocabulary, they appear in the LM only as unigrams (with some smoothing probability), so when you score, all the words will use the same backoff prob for <s> and </s>, so it will be as if they are not there.
Do you think it would be easy to add such flags?
Not really. The library is based on OpenFst, so has a weighted finite-state transducer (WFST) mechanism underlying it. In a WFST, inference is made over sequences, which naturally include a start and stop to the sequence. This is unlike many LM uses, e.g., using it to query the probability of a specific n-gram, as is done in many machine translation scenarios. Here one doesn't query the probability of an n-gram, rather the probability of a complete string, including ending the string. You can employ some of the FST tricks that I mentioned earlier -- you could even have every state have 0 final cost, so that ending the string does not contribute anymore to the probability. And people do use such transformations on models once they've been produced in OpenFst format, and that's great. But at that point it's no longer a normalized model, so many of the checks we have in place throughout the library to ensure proper normalization would complain, etc. So not easy to add flags. But feel free to alter the FST structure of the model after estimation to fit your needs, that shouldn't be too much trouble.
Hi, I should install OpenGRM Ngram as one of the dependencies for the gramophone tool (http://kaskade.dwds.de/~moocow/gramophone/README.txt) which I'd like to try out. I have installed the OpenFst tool which was also on the list of dependencies. Then I tried to install Ngram 1.1.0 (this version was tested with gramphopone). I ran the configure command:
<verbatim>
$./configure CXX="clang++ -std=c++11 -stdlib=libstdc++"
</verbatim>
and it didn't yield any errors, so I continued with the make command and got the following error:
<verbatim>
/bin/sh ../../libtool --tag=CXX --mode=link clang++ -std=c++11 -stdlib=libstdc++ -g -O2 -L/usr/local/lib/fst -lfstfar -lfst -lm -ldl -o ngramapply ngramapply.o ../lib/libngram.la
libtool: link: clang++ -std=c++11 -stdlib=libstdc++ -g -O2 -o .libs/ngramapply ngramapply.o -Wl,-bind_at_load -L/usr/local/lib/fst -lfstfar -lfst -lm -ldl ../lib/.libs/libngram.dylib
ld: warning: directory not found for option '-L/usr/local/lib/fst'
ld: library not found for -lfstfar
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[3]: * [ngramapply] Error 1
make[2]: * [all-recursive] Error 1
make[1]: * [all-recursive] Error 1
make: * [all] Error 2
</verbatim>
I have spent many hours trying to resolve this problem, but with no success. I hope someone here can help me. Thank you!
It looks like it can't find the FAR extension to OpenFst. When you install OpenFst, you need to configure with --enable-far so that this particular extension is enabled.
Thank you for your answer, Brian. I compiled the OpenFST 1.5.2 (read somewhere that python bindings are supported in this version) with this configuration:
./configure --enable-pdt --enable-bin --enable-ngram-fsts --enable-far=true --enable-python. I am trying to continue my Ngram compilation, indicating the include path with fst headers (otherwise they are not found at all): ./configure CPPFLAGS="-I../openfst-1.5.2/src/include/". Now the headers are kinda found but looks like they cannot be compiled...
checking fst/fst.h usability... no
checking fst/fst.h presence... yes
configure: WARNING: fst/fst.h: present but cannot be compiled
configure: WARNING: fst/fst.h: check for missing prerequisite headers?
configure: WARNING: fst/fst.h: see the Autoconf documentation
configure: WARNING: fst/fst.h: section "Present But Cannot Be Compiled"
configure: WARNING: fst/fst.h: proceeding with the compiler's result
configure: WARNING: ## ------------------------------------ ##
configure: WARNING: ## Report this to ngram@www.opengrm.org ##
configure: WARNING: ## ------------------------------------ ##
checking for fst/fst.h... no
configure: error: fst/fst.h header not found.
The autotools' page ( https://autotools.io/autoconf/finding.html ) says it has something to do with the incompatibility of the C dialects.. Could it be true? and if so, which C dialect should I use?
yeah, latest OpenGrm NGram library was released a couple of months ago, built to be compatible with OpenFst 1.5.1, and we haven't got the 1.5.2 compatible version out yet (working on it). If you are really interested in the ngram library, you'll have to work with 1.5.1 for the next week or two. We are planning a fairly large release quite soon, so keep an eye peeled for that. But in the meantime, you can access OpenFst version 1.5.1 on openfst.org.
I installed the 1.5.1 (at least it compiled without errors), but I am still getting the same "Present But Cannot Be Compiled" error for fst.h when running ./configure CPPFLAGS="-I../openfst-1.5.1/src/include/"
Sorry, I missed that you were installing NGram 1.1.0. The new openfst has been ported to C++11, so, yeah, this is the issue. you'll need to either (1) include a newer version of opengrm or (2) include an older version of openfst. Looks like openfst 1.3.4 was the last version that is not C++11.
Hi Brian, thank you for pointing out this version story )) I followed your first advice: I got the Ngram 1.2.2 and looks like it compiled just fine! Now I will struggle with the pyfst python binding.
Thank you for being so helpful and responsive.
Hi all,
I built G.fst in ngram format on x86 , and run on x86. it works fine
I built G.fst in ngram format on arm , and run on arm, it throws bad_malloc exception.
if you built G.fst in vector format on arm, and run on arm, it works fine.
How to solve the problem of loading ngram format fst ? Thank you.
I haven't played with that particular processor combination, but it is often the case that binary format FSTs built on one system architecture won't be readable in another. The easy workaround is to use fstprint (from the openfst library) to print the FST to a text representation on x86, then use fstcompile (also from the openfst library) on arm to get the binary format on that side. Keep in mind that you'll need the symbol table, too, so fstprint has flags to save_isymbols to a file; and fstcompile allows you to specify the isymbols and osymbols (both are the same in the case of the ngram models, what you saved from fstprint) and also that you want to keep_isymbols and keep_osymbols with the model. That would result in something that is in standard opengrm FST format.
Hi, I want to run my code (below) :
<verbatim>
using namespace ngram;
using namespace fst;
using namespace std;
int main()
{
cout<<"\n Hello Wordl" <<endl;
// ngramcount -order=5 in.fst > out.fst
NGramCounter<Log64Weight> ngram_counter(3);
StdMutableFst *fst = StdMutableFst::Read("in.fst", true);
ngram_counter.Count(*fst);
VectorFst<StdArc> ofst;
ngram_counter.GetFst(&ofst);
ofst.Write("out.fst");
return(0);
}
</verbatim>
but it gives me the error:
<verbatim>In function `fst::CompatProperties(unsigned long, unsigned long)':
...
undefined reference to `fst::PropertyNames' undefined reference to 'fst::PropertyNames'</verbatim>.
I think that it is a link error, but i can't get any solution.
Thanks.
Not sure how much help we can be in these cases, this depends on the OS, the way in which you installed openfst, and so many other factors local to your setup. It certainly isn't finding something and you are likely correct that this is a linking error. I would suggest double checking that your library linking paths are correct. perhaps reinstalling openfst and opengrm with as few divergences from standard installation as possible. Not sure that helps. Good luck.
Hi is it possible to create class-based ngrams with opengrm? I have been creating them in SRILM and then converting them, but I was wondering if there is a method for doing this in opengrm.
yes, you can build class-based models with opengrm, though you have to build a second transducer that maps from words to classes. (See section 5 of this paper: http://www.aclweb.org/anthology/P/P03/P03-1006.pdf.) For example, if you are using POS-tags, then you would have a single state transducer T mapping from words to POS-tags, each arc having the word on the input side and the POS-tag on the output side, weighted with the cost -log P(word | tag). Then train a standard ngram model on POS strings, which gives you a WFST G encoding the sequential dependencies of the classes. Finally, compose T and G, and this is your class-based language model. The paper above contains information on more general cases, such as classes including multiple words. While there is no single command to run and produce a class-based language model, the WFST encoding of the models makes this quite straightforward by using composition. Hope that helps!
brian
outputs of fstprint and of ngramprint | ngramread | fstprint different
Why is it that the above two command sequences result in very different outputs for automata created using ngramcount | ngrammake? When using an n-gram order of 1, the first one has just one state (and numerous arcs), the second one has numerous states (half as many as arcs).
Is there a way to print and read back the n-gram model without changing it?
Thanks! Géza
Hi Géza,
Short answer: use the --ARPA flag to produce ARPA format models, which do a better job of that sort of round-trip conversion. The other format is more for inspection -- for one thing, the backoff information is elided unless explicitly asked for. But to answer your question about what's going on here, for the single state unigram model, it seems that ngramread is constructing the ngram automaton so that each unigram points to a bigram state, but since there are no bigrams, that bigram state backs off (with probability 1, cost 0) to the unigram state. So you get the unigram arc, a bigram state and the backoff arc (hence 2x the number of arcs as states). This is technically a correct topology for the model, but not the most compact that it could be. If you are only using a unigram, you could perform epsilon removal (fstrmepsilon) after ngramread and you would get what you want. For higher order ngram models, that's not going to work. There I would suggest the --ARPA flag to use that format.
Hi there! Is there a command to remove arcs that have the same input symbol as another arc in the WFST model, but a different output symbol and a higher cost? Thanks!
Background: I created a transducer from n-grams. But the resulting model contains multiple different outputs for some input symbol sequences, and so it cannot be determinized. Just removing input string with multiple output sequences from the training data doesn't solve this unfortunately.
Hi Geza,
no, there is no command to do this, but it is related to some work we did using the lexicographic semiring to combine output labels with standard costs in a new, composite cost, which then allows determinization to remove higher cost arcs with the same input. (See http://www.mitpressjournals.org/doi/abs/10.1162/COLI_a_00198) Within a language model, however, such determinization may be costly, depending on the amount of ambiguity, given that it's a cyclic automaton -- in fact, under certain circumstances it may not be determinizable. If you perform the determinization after composing the model with inputs -- on a lattice of possible outputs -- then things should work out.
not sure if this helps.
brian
Hi Brian,
Thanks for your answer! I realize that my question was naive anyway, as of course you cannot do this on the level on n-grams, and you do need that ambiguity to find the best path.
Thanks for this paper! The approach you mention is available through in C++ only, right? Is perhaps a description available somewhere that focuses on how to use it (technical details)?
Thanks! Géza
Hello,
I am trying to install OpenGrm on Ubuntu while linking it to a nonstandard OpenFST installation location. I ran configure with the following options:
./configure LDFLAGS=-L/proj/speech/tools/phonetisaurus/3rdparty/openfst-1.3.2/installation/include CPPFLAGS=-I/proj/speech/tools/phonetisaurus/3rdparty/openfst-1.3.2/installation/include --prefix=/proj/speech/tools/phonetisaurus/3rdparty/opengrm-ngram-1.0.3/installation/
Then I noticed that the compilation was still looking for OpenFST stuff in the standard install location, so I saw that I had to change AM_LDFLAGS in src/bin/Makefile to the following:
AM_LDFLAGS = -L/proj/speech/tools/phonetisaurus/3rdparty/openfst-1.3.2/installation/lib/fst -lfstfar -lfst -lm -ldl
When I run make, I still get what appears to be a linking error. The output is very large so you can see it here:
http://pastebin.com/1ms42A2b
Am I missing any flags to link to OpenFST? Any pointers would be greatly appreciated.
One thing you might try is changing:
libngram_la_LDFLAGS = -version-info 1:0:0
in src/lib/Makefile.am to
libngram_la_LDFLAGS = -version-info 0:1:0 -lfst
prior to configure to make FST linking explicit. (h/t to Giulio Paci for this.)
Hi
I would like to write an Android app that does word prediction and completion as you type, using an n-gram language model.
I find a lot of references on google on how to build language models and evaluate their perplexity (srilm, mitlm, etc. and here, openfst and opengrm-ngram), but not a lot on how to apply them to a word prediction problem in practice. I am completely new to wfsts. Is there perhaps a standard recipe for word prediction using openfst and opengrm-ngram?
From what I can gather, it seems that you must build the language model fst, then compose it with an fst that is built from the seen word sequence up to the point in question, and finally do a shortest path search. Perhaps something along the lines of the following? :
<verbatim>
fstcompose words.fsa lm.fst | fstshortestpath > words.fst
</verbatim>
where lm.fst is hopefully a language model fst that can be created by opengrm-ngram, as described in http://www.openfst.org/twiki/bin/view/GRM/NGramQuickTour?
How would words.fsa be created?
How would the resulting words.fst be processed to obtain the n-best predicted words? Perhaps using something like? :
<verbatim>
fstproject --project_output
</verbatim>
Regarding Android, I see that they have packaged openfst for the platform. Is a language model fst created by opengrm-ngram completely "backwards-compatible" with openfst, so that I can load the model fst and do composition and shortest path search with only the openfst API in Android?
Thanks in advance!
Hi,
there are some wrinkles to using a language model (LM) fst in the way you describe. But as to your last question, the language models that are produced are simply FSTs in the format of openfst, so you should be able to use them via the openfst API in Android. The problem with the fstcompose approach that you detail is that the opengrm LM fst provides probabilities over whole sequences, not just the next word. One approach to building words.fsa would be to make this represent abc(sigma) where abc is your history and sigma ranges over your possible next words. But the probabilities that you would derive for each x would be for string the abcx, where x is the last word in the string. This doesn't include probability for abcxy. instead, to get the right probability over all possible continuations, you want words.fsa to represent abc(sigma)*. That would give you the right probability mass, but unfortunately would be expensive to compute, especially after each word is entered. The better option is for you to use the C++ library functionality to walk the model with your history, find the correct state in the model (the model is deterministic for each input, when treating backoff arcs as failures), then collect the probabilities for all words from that state, following backoff arcs correctly to achieve appropriate smoothing. This requires being very aware of exactly how to walk the model and collect probabilities for all possible following words, then finding the most likely ones, etc., efficiently. It is possible, but non-trivial. As with other openfst functionality, it is there for you to create your own functions, but not much in the way of hand holding to get you there. Very interesting application, though; and by the end you'd understand the FST n-gram model topology well and be able to build other interesting applications with it. Hope that answers at least some of your question.
Brian
Is there a path similar to that in the quick tour (http://openfst.cs.nyu.edu/twiki/bin/view/GRM/NGramQuickTour#OpenGrm_NGram_Library_Quick_Tour) for character n-gram models?
I have a trivial corpus that works fine with the instructions described there for a word n-gram model, but I can't figure out how to use ngramcount to build a character model.
I'm trying (with a 4-line corpus file titled 'animals.txt'):
farcompilestrings -token_type=utf8 -keep_symbols=1 animals.txt >animals.char.far
ngramcount -order=5 animals.char.far >animals.cnt
The 'farcompilestrings' command appears to work, but ngramcount fails with the error "ERROR: None of the input FSTs had a symbol table". I've tried with '-keep_symbols=0', and and without any '-keep_symbols' argument, and the results seem to be the same
I tried creating a symbol file with the UTF-8 characters in my 'corpus', but farcompilestrings doesn't like the space symbol in that file (it reports 'ERROR: SymbolTable::ReadText: Bad number of columns (1), file = animals.char.syms, line = 5:<>').
It seems like there must be an option to tell ngramcount to use bytes or UTF-8 characters as symbols (analogous to farcompilestrings '-token_type=utf8'), but I haven't found it.
Thanks in advance for any suggestions.
Hi Aaron,
ngramcount wants an explicit symbol table, which is why the utf8 token_type for farcompilestrings isn't working. And, correct, farcompilestrings uses whitespace as the symbol delimiter, so representing spaces is a problem. Agreed, it would be nice to have that option in ngramcount, perhaps in a subsequent version... In the meantime, we generally use underscore as a proxy for space for these sorts of LMs. So, convert whitespace to underscore, then whitespace delimit and run as with a standard corpus. Then you just have the chore of converting to/from underscore when using the model. As an aside, you might find Witten-Bell to be a good smoothing method for character-based LMs, or any scenario with a relatively small vocabulary and a large number of observations. You can set the witten_bell_k switch to be above 10, and that should give you better regularization. Hope that helps.
brian
Hi Aaron,
For step 3, remember to include 0 as integer label for epsilon (e.g. <epsilon>) and choose something else than an actual tab for the symbol corresponding to the utf8 integer value for tab.
Cyril
I get the following error when i execute the make command for opengrm installation
/usr/local/include/fst/test-properties.h:236: error: undefined reference to 'FLAGS_fst_verify_properties'
/usr/local/include/fst/test-properties.h:236: error: undefined reference to 'FLAGS_fst_verify_properties'
/usr/local/include/fst/test-properties.h:236: error: undefined reference to 'FLAGS_fst_verify_properties'
collect2: ld returned 1 exit status
make[3]: * [ngramapply] Error 1
make[3]: Leaving directory `/home/abhiraj/NLP/opengrm-ngram-1.0.3/src/bin'
make[2]: * [all-recursive] Error 1
make[2]: Leaving directory `/home/abhiraj/NLP/opengrm-ngram-1.0.3/src'
make[1]: * [all-recursive] Error 1
make[1]: Leaving directory `/home/abhiraj/NLP/opengrm-ngram-1.0.3'
make: * [all] Error 2
i have unstalled openfst multiple times, still cant figure out what needs to be done.
I'd really appreciate some help
It looks like ld is not finding the OpenFst library shared objects. It would be great is you could show us what was the ld command line make was trying to run at that time. Also could you tell us which configure=/=make commands/parameters you used to install OpenFst and try to compile OpenGrm. Finally, which OS are using?
Negative weights associated with epsilon transitions
I understand that the transition weights in the ngram model are negative log probabilities. (correct me if I am wrong)
But I can see that the epsilon transitions also take negative weights. What do they actually represent ?
epsilon transitions are for backoff, and the weights on those arcs are negative log alpha, where alpha is the backoff constant that ensures normalization over all words leaving that state. Ideally, those backoff transitions are interpreted as failure transitions, i.e., only traversed if the symbol does not label a transition leaving the state. See the ngramapply command for an example of the use of a phi matcher to handle failure transitions.
Brian
Constraining the minimum length of sentences in random generating
While sampling sentences using ngramrandgen do we have an option to constrain the minimum length of sentence generated just as we have for maximum length
Sorry, no, we don't have that option. The solution is to oversample, then throw away strings with length less than your minimum using grep or something like that.
Hi,
To build an automaton that has valid paths, you need to specify some kind of termination cost, which is what the final symbol allows for. So, you can't really do without having an entry for end-of-string in your raw n-gram counts; otherwise, it will represent this as an automaton that accepts no strings (empty) since no string can ever terminate. If you have a large set of counts without start or stop symbols, you should just add a unigram of the end-of-string symbol with some count (maybe just 1). The start symbol you can do without (though it is often an important context to include). Hope that helps.
brian
I have been using ngramapply and ngramperplexity tools. I am wondering if there is an option to specify not to use the ending state in the calcul of logprob.
Hi,
if you run it in verbose mode (ngramperplexity --v=1 as shown in the quick tour) then you can see the -log probability of each item in the string. You can then read this text into your own code to calculate perplexity in whatever way you wish. However, typically the end of string marker is included in standard perplexity calculations, and should be included if you are comparing with other results. No options to exclude it in the library. Hope that helps.
brian
Hi,
Thanks you for the hint. Could you tell me why I get different perplexity values when running ngramapply and ngramperplexity tools (in verbose mode) with same data/parameters.
ngramapply only applies the LM weights to your input token sequence, it does not compute perplexity.
The total weight output by ngramapply for each string should be equal to the logprob(base 10) generated in the verbose output for ngramperplexity (but note that the ngram apply value is negative log using base e).
If you want to convince yourself you can try this:
#Apply the model to your test data
$ ngramapply toy.mod test.far test.scored.far
#Extract the individual results
$ farextract --filename_prefix=tst test.scored.far
#Print out one of the sentences, which now have individual arc weights
$ fstprint tsttest.sent-1
0 1 a a 0.750776291
1 2 b b 1.20206916
2 3 a a 1.43548453
3 4 a a 0.864784181
4 5 a a 0.864784181
5 1.27910674
#Add all the values up and compute the perplexity in the normal manner
$ python
>>> import math
>>> -1.*(0.750776291+1.20206916+1.43548453+0.864784181+0.864784181+1.27910674)/math.log(10)
-2.778184008253953
>>> math.pow(10, ((0.750776291+1.20206916+1.43548453+0.864784181+0.864784181+1.27910674)/math.log(10))/(5+1))
2.9042277315216354
Those values should match whatever you get out of ngramperplexity (again careful about the log base) in verbose mode for each individual sentence.
In my toy example case above:
$ ngramperplexity --v=1 toy.mod test.far
WARNING: OOV probability = 0; OOVs will be ignored in perplexity calculation
a b a a a
ngram -logprob
N-gram probability found (base10)
p( a | <s> ) = [2gram] 0.326058
p( b | a ...) = [2gram] 0.522052
p( a | b ...) = [1gram] 0.623423
p( a | a ...) = [2gram] 0.375571
p( a | a ...) = [2gram] 0.375571
p( </s> | a ...) = [2gram] 0.555509
1 sentences, 5 words, 0 OOVs
logprob(base 10)= -2.77818; perplexity = 2.90423
I have been experimenting a bit with the new ngrammarginalize tool - very cool. I noticed that in some cases it can get stuck in the do{}while loop in the StationaryStateProbs function.
I trained a model using a small amount of data and witten_bell smoothing + pruning. It may be worth noting that the data is very noisy.
I have been digging around in the paper and source code a bit to figure out what might be the issue. The issue seems to be here:
if (fabs((*probs)[st] - last_probs[st]) > converge_eps * last_probs[st] )
where last_probs[st] can occasionally be a very small negative value. If I am following the paper correctly, this corresponds to the epsilon value mentioned at the end of Section 4.2. (I'm not entirely convinced I followed correctly to this point though so please correct me if I'm wrong).
Anyway, if last_probs[st] turns out to be negative, for whatever reason, then there is a tendency to get stuck in this block. This also seems to be affected by the fact that the comparison delta is computed relative to last_probs[st] , so if last_probs[st] happens to get smaller with each iteration, then the comparison also becomes more 'sensitive', in the sense that a smaller absolute difference will still evaluate to 'true'. So as we iterate, it seems that some values become more sensitive to noise (maybe this is the numerical error you mention?) and the process gets stuck.
I noticed that if I set the comparison to an absolute:
if (fabs( fabs((*probs)[st]) - fabs(last_probs[st])) > converge_eps )
then it always terminates, and does so quickly for each iteration, even for very small values of converge_eps.
I have not convinced myself that this is a theoretically acceptable solution, nor have I validated the resulting models, but it does some reasonable at a superficial level.
I would definitely appreciate some feedback if available.
EDIT: this second issue may be ignored. It was my fault.
Also, I notice that I am still getting 'unplugged' 'holes' some of my pruned models when I run ngram-read after pruning with SRILM.
Specifically, I prune with srilm, then run: ngramread it terminates without issue. If I run the info command I get:
$ ngraminfo l2p5k.train.p5.mod
FATAL: NGramModel: bad ngram model topology
and if I run the marginalize command (without any of the modifications I mentioned above) I get:
$ ngrammarginalize --v=3 --iterations=2 --norm_eps=0.001 l2p5k.train.p5.mod l2p5k.train.p5.marg.mod
INFO: Iteration #1
INFO: FstImpl::ReadHeader: source: l2p5k.train.p5.mod, fst_type: vector, arc_type: standard, version: 2, flags: 3
INFO: CheckTopology: bad backoff arc from: 557 with order: 4 to state: 0 with order: 1
FATAL: NGramModel: bad ngram model topology
Maybe there is a flag I am missing? Or I should just pre-process and plug them manually prior to conversion?
Hi Josef,
For the second issue, I would be interested in seeing the original ARPA format file and follow this through to see what the potential issue might be. So please email or point me to that if you can. For the first issue, I have seen some problems with the convergence of the steady state probs for Witten-Bell models in particular, which is due, I believe, to under-regularization (as we say in the paper). Your tracing this to the value of last_probs[st] is something I hadn't observed, so that could be very useful. In answer to your question, I wonder if there's not a way to maintain the original convergence criterion, but control for the probability falling below zero (presumably due to floating point issues). One way to do this would be to set a very small minimum probability for the state and set a state's probability to that minimum if the calculated value falls below it. Since last_probs[st] gets its value from (*probs)[st], then it would never fall below 0 either. I'll look into this, too, thanks.
brian
Hi, The 2nd issue was my own embarrassing mistake. I was setting up a test distribution to share, and realized that I still had an old, duplicate version of ngramread on my $PATH. This was the source of the conversion issue. When I switched it things worked as expected for all models.
I will share an example of the other via email.
Hi, I am new to this librbary. I have created a small language model using a corpus. Is there a solution in applying the model in user input, which may contain an unknown word?
How can I handle unknown words?
Hi,
when you apply the model to new text, you are typically using farcompilestrings to compile each string into FSTs. It has a switch (--unknown_symbol) to map out of vocabulary (OOV) items to a particular symbol. The language model then needs to include that symbol, typically as some small probability at the unigram state. There are a couple of ways to do this -- see the discussion topic below entitled: Model Format: OOV arcs.
brian
Hello,
that is certainly possible, and we'll add it to the list of features for a future release. Likely not in the soon-to-be-released latest version, but perhaps in the next update after that.
brian
Brian,
thank you for your answer. I am no expert of FSTs, but I guess It should be possible to obtain the same result by somewhat filtering the complete FST, through composition (?) with another FST. After all, it should look like a FST where some paths have already been walked through. Please excuse my vagueness.
yes, this is the way to think about it. The complication comes from the random sampling algorithm, and the way that it interprets the backoff arcs in the model. The algorithm assumes a particular model topology during the sampling procedure, which will not be preserved in the composed machine that you propose. But, yes, essentially that is what would be done to modify the algorithm.
brian
Hi Geza,
Skip k-grams generally require a model structure that is trickier to represent compactly in a WFST than standard n-gram models. This is because there are generally more than one state to backoff to. For example, in a trigram state, if the specific trigram 'abc' doesn't exist, a standard backoff n-gram model will backoff to the bigram state and look for 'bc'. With a skip-gram, there is also the skip-1 bigram 'a_c' to look for, i.e., there are two backoff directions. So, to answer your question, it is something we've thought about and are thinking about, but nothing imminent.
brian
calculating perplexity for >10 utterances using example command
I am a newbie who wants to calculate perplexity for a text file consisting of more than 10 lines. The examples provided for that only works for < 10, and it is not obvious to me how to bypass that. (Yes, I know I can use a loop over the utterances, I just hope I do not have to.)
Using stdin with farcompilestrings results in an error message:
FATAL: STListWriter::Add: key disorder: 10
FATAL: STListReader: error reading file: stdin
And specifying a text file as the first parameter for farcompilestrings does not work either:
FATAL: FarCompileStrings: compiling string number 1 in file test.txt failed with token_type = symbol and entry_type = line
FATAL: STListReader::STListReader: wrong file type:
I could not find related usage info. I'd appreciate some help.
Thanks!
The example command would be this:
echo -e "A HAND BAG\nA HAND BAG\nA HAND BAG\nA HAND BAG\nA HAND BAG\nA HAND BAG\nA HAND BAG\nA HAND BAG\nA HAND BAG\nBAG HAND A" | farcompilestrings -generate_keys=1 -symbols=earnest.syms --keep_symbols=1 | ngramperplexity --v=1 earnest.mod -
Hi Geza,
the -generate_keys=N switch for farcompilestrings creates numeric keys for each of the FSTs in the FAR file using N digits per FST. (One FST for each string in your case.) So, with N=1 you can index 0-9 strings, with N=2 you can index 0-99 strings, etc. So, for your example, you just need to up your -generate_keys argument with the number of digits in the total corpus count.
brian
Thanks, Brian!
One more question, just so that I can see more clearly how ngramperplexity works.
When I specify the "--OOV_probability=0.00001 --v=1" arguments, I get "[OOV] 9" for "ngram -logprob" in the output when using unigrams, and somewhat higher values for longer n-grams. What is the reason for his? (I guess I do not yet see how exactly perplexity calculation works.)
Thanks, Géza
When I load a pre-trained language model into c++ and iterate over arcs, I see that there are no arcs labeled with the given OOV symbol, although it is contained in the symbol table.
Are OOV words handled somewhere other than the FST itself, or is the absence of these arcs likely due to a quirk in my particular language model?
Any insight or ideas would be greatly appreciated. Thanks!
Hi,
The model will only have probability mass for the OOV symbol if there are counts for it in the training corpus. ngramperplexity does have a utility for including an OOV probability, but this is done on the fly, not in the model structure itself. If you want to provide probability mass at the unigram state for the OOV symbol, you could create a corpus consisting of just that symbol, train a unigram model, then use ngrammerge to mix (either counts or model) with your main model. Then there would be explicit probability mass allocated to that symbol. You can use merge parameters to dictate how much probability that symbol should have. Hope that helps.
brian
A related question: would it be sensible to replace some (or all) singletons in the training corpus with the OOV symbol, thus learning OOV probabilities in context instead of solely at the unigram level? Is that approach likely to improve LM performance on unseen data?
Hi,
I built 3-gram language model on few English words.
My c++ program receive streaming character one by one.
I would like to use the 3-gram model to score the up-coming character with history context. What example can I start with ? Is it possible not to convert to farstrings each time ?
Hi,
This is one of the benefits of having the open-source library interface in C++, you can write functions of your own. We choose to score strings (when calculating perplexity for example) when encoding the strings as fars, but you could perform a similar function in your own C++ code. I would look at the code for ngramperplexity as a starting point, and learn how to use the arc iterators. Once you understand the structure of the model, you should be able to make that work. Alternatively, print out the model using fstprint and read it into your own data structures. Good luck!
brian
The .cc files for the command line binaries are (relatively) simple wrappers around the c++ library functions. So, for example, ngrammake.cc (called ngrammake_main.cc in the latest version) shows how to load an fst, initialize the various smoothing classes, make the model, etc. My suggestion would be to use those as your examples. hope that helps.
I tried order-1 ngram count with this simple text
Goose is hehe
Goose is hehe
Goose is
Goose
But I don't understand the resulting count fst.
0 -1.3863
0 0 Goose Goose -1.3863
0 0 is is -1.0986
0 0 hehe hehe -0.69315
The document says Transitions and final costs are weighted with the negative log count of the associated n-gram, but I can't make sense with these numbers, can someone help me out? Thx!!
Hi,
The counts are stored as negative natural log (base e), so -0.69315 is -log(2), -1.0986 is -log(3) and -1.3863 is -log(4). The count of each word is kept on arcs in a single state machine (since this is order 1) and the final cost is for the end of string (which occurred four times in your example). You printed this using fstprint, but you can also try ngramprint which in this case yields:
<s> 4
Goose 4
is 3
hehe 2
</s> 4
where <s> and </s> are the implicit begin of string and end of string events. These are implicit because we don't actually use the symbols to encode them in the fst.
Hope that clears it up for you. If not, try the link in the 'Model Format' section of the quick tour, to the page on 'precise details'.
best,
brian
Hi,
I try to merge two LM by following command line:
./tools/opengrm-ngram-1.0.3/bin/ngrammerge --alpha=0.2 --beta=0.8 --normalize --use_smoothing A.fst B.fst AB.mrg.fst
But I get a error:
FATAL: NGramModel: input not deterministic
A.fst is normal fst LM trained by SRILM. B.fst is class-expanded LM by fstreplace command.
I also try to convert fst LM into arpa LM by command line:
./tools/opengrm-ngram-1.0.3/bin/ngramprint --ARPA B.fst > B.arpa
But I got a same error:
Hi,
you've introduced non-determinism into the ngram models via your replace class modification. The ngrammerge (and ngramprint) commands are simple operations that expect a standard n-gram topology, hence the error messages. For more complex model topologies of the sort you have, you'll have to write your own model merge function that does the right thing when presented with non-determinism. The base library functions don't handle these complex examples, but the code should give you some indication of how to approach such a model mixture. Such is the benefit of open source!
brian
error when converting LM genereted by HTK into fst format
Hi,
I try to convert arpa LM genereted by HTK tool into fst format. The command is :
./tools/opengrm-ngram-1.0.3/bin/ngramread --ARPA test.arpa > test.lm.fst
But I get a error:
Hi,
it appears that you have n-grams ending in your stop symbol (probably </s>) that have backoff weights, i.e., the ARPA format has an n-gram that looks like:
-1.583633 XYZ </s> -0.30103
But </s> means end-of-string, which we encode as final cost, not an arc leading to a new state. Hence there is no state where that backoff cost would be used. (Think of it this way: what's the next word you predict after </s>? In the standard semantics of </s>, it is the last term predicted, so nothing comes afterwards.) Do you also have n-grams that start with </s>?
So, one fix on your ARPA format is just to remove the backoff weight after n-grams that end in </s>.
hope that helps,
brian
Hi, Brian
Thank you! I get it.
Another case is there are n-grams that start with </s> in my HTK LM. I think it is a bug of HTK tool, but It is a the only choice to train class-based LM. How do I fix it? Is it reasonable to remove directly these n-grams?
Thanks,
Huanliang Wang
Hi, Brian
Thank you! I get it.
Another case is there are some n-grams that start with </s> in my HTK LM. I think it is a bug of HTK tool, but it is my only choice to train class-based LM with automatic class clustering from large plain data . How do I fix it? Is it reasonable to remove directly these n-grams?
Thanks,
Huanliang Wang
yes, you might try just removing those n-grams. In the ARPA format, you'll have to adjust the n-gram counts at the top of the file to match the number you have at each order.
Hi, Brian
Thank you! I got it.
Could you give me an example how to use fstreplace to replace e nonterminal label in a fst by another fst?
Thanks,
Huanliang Wang
Hi,
I try to convert arpa LM genereted by HTK tool into fst format. The command is :
./tools/opengrm-ngram-1.0.3/bin/ngramread --ARPA test.arpa > test.lm.fst
But I get a error:
Hi,
I'm currently playing around with a test example and I noticed than after ngrammake if I call fstinfo (not ngraminfo) on the resulting language model fstinfo complains about the model being ill-formed. This is due to transitions (typically on epsilons) that have "Infinity" weight, which does not seem to be supported by openFST. Is that "working as intended"? The problem is later if I call fstshortestpath to get e.g. the n most likely sentences from the model the result contain not only "Infinity" weights but also "BadNumber" which might be a result of the infinite values.
Thanks,
Roland
Hi Roland,
yes, under certain circumstances, some states in the model end up with infinite backoff cost, i.e., zero probability of backoff. In many cases this is, in fact, the correct weight to assign to backoff. For example, with a very small vocabulary and many observations, you might have a bigram state that has observations for every symbol in the vocabulary, hence no probability mass should be given to backoff. Still, this does cause some problems with OpenFst. In the next version (due to be released in the next month or so) we will by default have a minimum backoff probability of some very small epsilon (i.e., very large negative log probability). As a workaround in the meantime, I would suggest using fstprint to print the model to text, then use sed or perl or whatever to replace Infinity with some very large cost -- I think SRILM uses 99 in such cases, which would work fine.
hope that helps,
brian
If I may add another quick question, when running fstshortestpath on the ngram count language model (i.e. after ngramcount but before ngrammake) I was expecting to get the most frequent n-gram, but instead the algorithm never seems to terminate. Any idea why that is? I though that shortestpath over the tropical sr should always terminate anyway.
Thanks,
Roland
The ngram count Fst contains arcs with negative log counts. Since the counts can be greater than one, the negative log counts can be less than zero. Hence the shortest path is an infinite string repeating the most frequent symbol. Each symbol emission shortens the path, hence non-termination.
brian
Hi. I maintain several voice-recognition-related packages, including openfst, for the Fedora Linux distribution. I am working on an OpenGrm NGram package. My first attempt at building version 1.0.3 (with GCC 4.7.2 and glibc 2.15) failed:
In file included from ngramrandgen.cc:32:0:
./../include/ngram/ngram-randgen.h:55:48: error: there are no arguments to 'getpid' that depend on a template parameter, so a declaration of 'getpid' must be available [-fpermissive]
./../include/ngram/ngram-randgen.h:55:48: note: (if you use '-fpermissive', G++ will accept your code, but allowing the use of an undeclared name is deprecated)
ngramrandgen.cc:39:1: error: 'getpid' was not declared in this scope
ngramrandgen.cc:39:1: error: 'getpid' was not declared in this scope
It appears that an explicit #include is needed in ngram-randgen.h. That header was probably pulled in through some other header in previous versions of either gcc or glibc.
I was wondering what the expected result is when feeding a lattice, rather than a string/sentence, to the ngramperplexity utility? Is this supported? It seems to report the perplexity of an arbitrary path through the lattice.
Hi Josef,
ngramperplexity reports the perplexity of the path through the lattice that you get by taking the first arc out of each state that you reach. (Note that this is what you want for strings encoded as single-path automata.) Not sure what the preferred functionality should be for general lattices. Could make sense to show a warning or an error there; but at this point the onus is on the user to ensure that what is being scored is the same as what you get from farcompilestrings - unweighted, single-path automata. If you have an idea of what preferred functionality would be for non-string lattices, email me.
brian
Hi,
I do not want to print my fst and execute NGramApply in bash before reading the new fst again in c++.
Is there a method to use the method NGramApply directly in c++ ?
Thanks
Hi Markus,
there is no single method; rather there are several ways to perform composition with the model, depending on how you want to interpret the backoff arcs. The most straightforward way to do this in your own code is to look at src/bin/ngramapply.cc and use the composition method for the particular kind of backoff arc, e.g., ngram.FailLMCompose() when interpreting the backoff as a failure transition. In other words, write your own ngramapply method based on inspection of the ngramapply code.
Hope that helps,
brian
Hi,
thanks, I think yes that should work.
I am using FailureArcs and my LM fst is created, so I do not need to build a lm fst out of strings or an ARPA lm.
I first just need to read the fst lm from my disk:
#include <ngram/ngram.h>
fst::StdMutableFst *fstforNGram;
fstforNGram->Read($MYNGRAMFST);
ngram::NGramModel ngram(fstforNGram);
// that seems not to work, as: undefined reference to `ngram::NGramModel::InitModel()'
If I read the lm , I could then just add:
ngram.FailLMCompose(*lattice, &cfst, kSpecialLabel);
and the composed fst should be ready, right?
Thanks for helping
yes, but I have a problem to read the fst lm in c++:
fst::StdMutableFst *fstforNGram;
fstforNGram->Read($MYNGRAMFST);
to that point it works.
ngram::NGramModel ngram(fstforNGram);
that seems not to work, as: undefined reference to `ngram::NGramModel::InitModel()'
Thanks
Hi, I have been using OpenGrm with my Grapheme-to-Phoneme conversion tools for a while now and recently added some functionality to output weighted alignment lattices in .far format.
It is my understanding that these weighted lattices can only currently be utilized with Witten-Bell smoothing; is this correct?
Is there any plan to support fractional counts with Kneser-Ney smoothing, for instance along the lines of,
"Correlated Bigram LSA for Unsupervised Language
Model Adaptation", Tam and Schultz.
or would I be best advised to implement this myself?
Hi Josef,
Witten-Bell generalizes straightforwardly to fractional counts, as you point out. No immediate plans for new versions of other smoothing methods along those lines, so if that's something that you need urgently, you would need to implement it.
brian
Hi Luke,
this is basically a floating point precision issue, the system is trying to subtract two approximately equal numbers (while calculating backoff weights). The new version of the library coming out in a month or so has much improved floating point precision, which will help. In the meantime, you can get this to work by modifying a constant value in src/include/ngram/ngram-model.h which will allow these two numbers to be judged to be approximately equal. Look for:
static const double kNormEps = 0.000001;
near the top of that file. Change to 0.0001, then recompile.
This sort of problem usually comes up when you train a model with a relatively small vocabulary (like a phone or POS-tag model) and a relatively large corpus. The n-gram counts end up not following Good-Turing assumptions about what the distribution should look like (hence the odd discount values). In those cases, you're probably better off with Witten-Bell smoothing with the --witten_bell_k=15 or something like that. Or even trying an unsmoothed model.
And stay tuned for the next release, which deals more gracefully with some of these small vocabulary scenarios.
Brian
I generated an ngram model from a .arpa file with the following command:
ngramread --ARPA lm.arpa > lm.model
ngramread does not complain, but ngraminfo and trying to load the model from C++ code generate the following error:
FATAL: NGramModel: bad ngram model topology
How can I troubleshoot the problem?
Hi,
that error is coming from a sanity check that verifies that every state in the language model (other than the start and unigram states) is reached by exactly one 'ascending' arc, that goes from a lower order to a higher order state. ARPA format models can diverge from this, by, for example, having 'holes' (e.g., bigrams pruned but trigrams with that bigram as a suffix retained). But ngramread should plug all of those. maybe duplication? I'll email you about this.
Benoit found a case where certain 'holes' from a pruned ARPA model were not being filled appropriately in the conversion. The sanity check routines on loading the model ensured that this anomaly was caught (causing the errors he mentioned), and we were able to find the cases where this was occurring and update the code. The updated conversion functions will be in the forthcoming version update release of the library, within the next month or two. In the meantime, if anyone encounters this problem, let me know and I can provide a workaround.
Access control: