You can use the formatting commands describes in TextFormattingRules in your comment.
If you want to post some code, surround it with <verbatim> and </verbatim> tags.
Auto-linking of WikiWords is now disabled in comments, so you can type VectorFst and it won't result in a broken link.
You now need to use <br> to force new lines in your comment (unless inside verbatim tags). However, a blank line will automatically create a new paragraph.
Hello, just a quick question. Forgive me if it sounds naive. So, is it possibile to use backreference to captured groups in Thrax. Imagine you have a string such as:
'12-10-1492' (the 12th of October 1492) and you want to write a Thrax grammar which is able to swap the position of months and days ('10-12-1492'). With a normal regex you would just match the whole string and capture days and month separately, and would rewrite the whole thing by swapping the position of the first 2 groups.
(\d{2})-(\d{2})-(\d{4}) ---> $2-$1-$3
Is it possible to do something similar in Thrax?
Not as such. By "normal regex" you mean PCRE's, but note that those have greater than regular power precisely because of these copy operations.
In Thrax there are multi-push-down transducers (MPDT's --- see under Multi-Pushdown Transducers in https://www.openfst.org/twiki/bin/view/GRM/ThraxQuickTour) which gives you the same power, but they are not particularly easy to set up, and not generally very efficient.
For the particular case you mention it would be much simpler just to do the brute force thing and simply transduce between each MM-DD and its equivalent DD-MM. Ugly, but it works.
I sse. Thanks, unfortunately for my case the ugly solution won't be effective. The date thing was just an example, in reality I have to operate among groups which are not limited to just 00-24 digits. Thanks anyway for the answer.
Hi, yes, that's very nice of you. So, imagine I have a list of currencies such as:
currencies = "euros" |"dollars"|"pounds"|"pesos"|"yens";
now imagine you have a string that you need to transform into a fully formatted thing:
5 euros and 37 cents ----> 5,37 euros
obviously I can write something like:
CDRewrite["cents" : "euros", digit+ " euros and " digit+ " ", rb, star]
Though, if I want to have a similar effect for the other currencies, I should create a similar CDRewrite for each of them. What I want is instead a function which is able to create automatically a CDRewrite like that for each currency in my list. So, detecting any currency name, remembering which one is it and then move it at the end of the string, taking the place of the cents token.
I don't have Thrax currently installed (oddly enough) so the following is in Pynini. But you can do something very similar in Thrax. There is no pretty solution, but one way is to do something along the following lines:
import pynini as py
from pynini.lib import byte
from pynini.lib import pynutil
from pynini.lib import rewrite
def I(x):
return pynutil.insert(x)
def D(x):
return pynutil.delete(x)
MARKERS = py.union("$", "£", "€")
SIGSTAR = py.closure(byte.BYTE)
NON_CURRENCY = (SIGSTAR - (SIGSTAR + MARKERS + SIGSTAR)).optimize()
# Replace the singular and plural of the major currencies with a symbol that
# remembers which one you had.
MAJOR_INS = (
py.cross("dollar", "$") |
py.cross("dollars", "$$") |
py.cross("pound", "£") |
py.cross("pounds", "££") |
py.cross("euro", "€") |
py.cross("euros", "€€")
)
# Major currencies are just the input of the above
MAJOR = py.project(MAJOR_INS, "input")
MINOR = py.union("penny", "pence", "cent", "cents")
# Perform the replacement
REPLACE_MAJOR = py.cdrewrite(
MAJOR_INS,
"",
" ",
SIGSTAR,
)
# Insert all markers at the end
INS_MARKER = SIGSTAR + I(MARKERS) + I(MARKERS).ques
# Delete the currency terms. Make sure they are followed by space or
# end of string so that it's greedy.
DEL_CURRENCY = py.cdrewrite(
D(MAJOR | MINOR),
"",
py.union(" ", "[EOS]", MARKERS),
SIGSTAR,
)
# Filter to make sure the first and second marker are the same.
FILT = (
NON_CURRENCY + "$" + NON_CURRENCY + "$" |
NON_CURRENCY + "$$" + NON_CURRENCY + "$$" |
NON_CURRENCY + "£" + NON_CURRENCY + "£" |
NON_CURRENCY + "££" + NON_CURRENCY + "££" |
NON_CURRENCY + "€" + NON_CURRENCY + "€" |
NON_CURRENCY + "€€" + NON_CURRENCY + "€€"
)
# Replace the final marker with the appropriate currency term.
REPLACE_MARKER = py.cdrewrite(
py.invert(MAJOR_INS),
"",
"[EOS]",
SIGSTAR,
)
# Get rid of the other one and clean up to replace with ","
DEL_MARKER = py.cdrewrite(
py.cross(" " + MARKERS + MARKERS.ques + " and ", ","),
"",
"",
SIGSTAR
)
RULES = (
REPLACE_MAJOR @
INS_MARKER @
DEL_CURRENCY @
FILT @
REPLACE_MARKER @
DEL_MARKER
).optimize()
inp1 = "25 dollars and 33 cents"
inp2 = "25 pounds and 22 pence"
inp3 = "1 dollar and 33 cents"
inp4 = "1 pound and 22 pence"
inp5 = "1 euro and 33 cents"
for inp in [inp1, inp2, inp3, inp4, inp5]:
print(f"{inp} --> {rewrite.one_top_rewrite(inp, RULES)}")
Output:
25 dollars and 33 cents --> 25,33 dollars
25 pounds and 22 pence --> 25,22 pounds
1 dollar and 33 cents --> 1,33 dollar
1 pound and 22 pence --> 1,22 pound
1 euro and 33 cents --> 1,33 euro
Hi - my NLP students are unable to do an assignment that worked fine last year because of a segfault.
Using Thrax 1.3.5 with the tiny example at https://www.openfst.org/twiki/bin/view/GRM/ThraxQuickTour#:~:text=foo.sym , I get
$ thraxcompiler --save_symbols --input_grammar=foo.grm --output_far=foo.far
Evaluating rule: foo_syms
Evaluating rule: foo
Segmentation fault (core dumped)
I'm told that downgrading to Thrax 1.3.3 fixes this problem, but introduces other problems elsewhere in the homework (segfault in thraxrewrite-tester).
What is the best way to insert a long string before a rule?
I want to insert a long string before relu to mark that relu belongs to a specific type, but it will take me about n times the time (n is about the length of the string divided by 5, which is uncertain, but it does rise exponentially) when I use the following method:
insert_types = (
("":" token { types { detail_types: \"")
relus
);
graph_insert = Optimize[insert_types];
export INSERT = CDRewrite[graph_insert, "", "", b.kBytes*];
In addition, I tried to use cdwrite directly:
But there is still no improvement in the results
export INSERT = CDRewrite[("":" token { types { detail_types: \""), "", relus, b.kBytes*]
Therefore, I want to know if there is a smarter way to realize this function. The main purpose is to reduce time-consuming
NOTE:
The long character is about 35 English letters. My encoding method is byte * the character is about 35 English letters. My encoding method is byte*
src/grammars/byte.grm exports kBytes which is the sigma for byte input. But what about UTF8 input?
(Granted, every valid UTF8-encoded code point is made up of bytes, but I don't know about just using kBytes as the UTF8 sigma.)
Is there a way to have OpenFST generate the sigma acceptor FST via command-line? Via code?
Hi, I have compiled OpenFST 1.8.1 and Thrax 1.3.6 for Android to integrate it per NDK into an app via JNI. I use only static libraries for linking and use basically the following configure arguments:
OpenFst:
<verbatim>
--enable-far --enable-grm --enable-fsts --disable-shared --enable-static CFLAGS="-fPIC" CXXFLAGS="-fPIC"
</verbatim>
Thrax:
<verbatim>
CFLAGS="-fPIC" CXXFLAGS="--fPIC" --disable-shared
</verbatim>
My code basically imitates what RewriteTester is doing, so I followed the initialization process from the original code quite rigidly.
Trouble is, some important FST types are not registered and when opening my .far file vie grm_.LoadArchive(), it bumps out with these messages on stdout:
<verbatim>
ERROR: GenericRegister::GetEntry: dlopen failed: library "vector-fst.so" not found
ERROR: Fst::Read: Unknown FST type vector (arc type = standard): <unspecified>
ERROR: Unable to open FAR: /storage/emulated/0/Android/data/com.grammatek.simaromur/files/g2p/g2p.far
FATAL: Check failed: "grm_.LoadArchive(m_farFile)" file: /Users/dschnell/AndroidStudioProjects/simaromur/app/src/main/cpp/g2p/G2P.cpp line: 58
</verbatim>
The file is definitely there, so I guess reading & interpreting it causes the trouble. I have tried to dig down the problem and compiled my code also with
<verbatim>
#define FST_NO_DYNAMIC_LINKING 1
</verbatim>
to get rid of the dlopen error. But effectively same error.
There is a lot of static initialization going on in OpenFST and Thrax with e.g. Flags, which I basically don't need because Android apps don't have a command line and accordingly no argv, argc are fed into main, so maybe the problem is related to this. But I have created at least an argv vector with nothing in it to fulfill the interface requirements.
Is there a possibility to initialize everything from scratch directly, so that I can continue or how can I convince Thrax/OpenFst to know about ingredients of my .far file in another way ?
I solved my problems: when integrating libThrax, libOpenFST and other static libs into a shared object, one needs to link those via -Wl,--whole-archive / -Wl,--no-whole-archive. Otherwise most globals of openfst & thrax will be omitted from the shared object.
Anyway, I think thrax & openfst are doing too many things globally. There should be implicit static initializers instead of relying on the runtime system to call everything in the right order.
Segmentation fault when splitting string with symbols
Hi,
I've been trying for a while segmenting a string with symbols read from SymbolTable. Everything looks fine but when running 'make' it throws a segmentation fault. This is my symbol file:
<eps> 0
AH0 1
T 2
R 3
IH2 4
P 5
L 6
EY1 7
AA1 8
B 9
...
And this is my grammar:
arpasyms = SymbolTable['arpabet.sym'];
export Consonant = Optimize[
"B".arpasyms
| "CH".arpasyms
| "D".arpasyms
| "DH".arpasyms
| "F".arpasyms
| "G".arpasyms
| "HH".arpasyms
...
The command I executed:
thraxmakedep -s arpabet.grm
make
The output:
thraxcompiler --input_grammar=arpabet.grm --output_far=arpabet.far
Evaluating rule: arpasyms
Evaluating rule: Consonant
Makefile:2: recipe for target 'arpabet.far' failed
make: * [arpabet.far] Segmentation fault (core dumped)
Sorry for the formatting.
The symbol file:
<eps> 0
AH0 1
T 2
R 3
IH2 4
P 5
L 6
EY1 7
AA1 8
B 9 ...
The command executed:
thraxmakedep -s arpabet.grm
make
The output:
thraxcompiler --input_grammar=arpabet.grm --output_far=arpabet.far
Evaluating rule: arpasyms
Evaluating rule: Consonant
Makefile:2: recipe for target 'arpabet.far' failed make: * [arpabet.far] Segmentation fault (core dumped)
There was some strange case where strings parsed using symbol tables were generated segfaults, but only on certain platforms (MacOS with XCode and certain GCCs on Linux). While it's not something we can easily test (unless one of you wants to send me a new M1 Macbook Air...) we believe that release 1.3.6 has eliminated it. Please try that and report back when you can...
What is the best method to deal with large entity list?
Hello, everyone. I've got a tricky problem when using thrax compiler. I have a large list of person names (over 1, 000, 000 words) and I want to use it in my rules? So I load it just like the tutorial recomends:
person = StringFile['person.txt'];
Then many rules reference this entity, for example:
rule1 = "who is " person;
rule2 = person " is a good man";
...
export RULE = Optimize[rule1 | rule2 | ... | rule500];
These rules may be more compilicated. When I use thrax to compile these rules, It will cause severe memory consumption and long compilation times, which can lead to compilation failures. Is there a problem? What is the best way to use it in this case?
Thank you.
Hello, Richard. Thanks for your suggestion. I tried reducing the size of entity list (to 100, 000 words). But this also makes a failure with 500 rules and many references. I use fstdraw tool to draw the fst structure and found that thrax makes a copy of the entity fst in eatch reference location. This causes the number of states and edges of the entire large FST to explode and makes optimization take too long and memory to explode when call `Optimize` function. Is that right? If so, how can I use large entity lists with Thrax?
Can you post your grammar and the string files you are trying to use somewhere? I can take a look. It is possible you'll have to split up the problem in some other way, but I need to see first.
Hi all! First time posting a question here, apologies if this is a newbie question.
I have an FST trying to match a large list of song names and there's a song called "sigma" and, of course, there's also a song called "epsilon". This was causing the FST to match to an empty string and was a real head scratcher for a bit. I've removed these reserve keywords from the large song list for now and am no longer matching to the empty string.
However, I'd like to know how we can match to these words without them being interpreted by the FST as the reserve words? In the documentation "\" only escapes the letter following it. Is there a way to escape an entire token?
Any help would be much appreciated!
Thank you, Danielle
Hey all,
I am trying to compile a large grm file (~5k patterns) but am getting the following error:
Parse Failed: memory exhausted
****************************************************************************
Line 4929
Is there a way to give the compiler more memory? I don't see anything super relevant in the thraxcompiler help message but perhaps I'm missing something. Or maybe I'm just exceeding the limits of what the compiler can handle and need to break down the file?
using thrax version 1.2.3 and OpenFST version 1.6.3, running this command:
thraxcompiler --save_symbols --input_grammar=<path to input> --output_far=<output path>
Thanks,
Dylan
Can you make what you are doing available so I can have a look? Most likely there is another way to factor things so this doesn't happen, but I'd have to see what you are doing first.
Basically I'm auto-generating a grm file from a catalog of music entities (albums, tracks, and artists) to be able to match known entity names in a string. It takes the form of:
<verbatim>
track_foo = "[foo]";
track_bar = "[bar]";
...
album_foobar = "[foobar]";
album_blah = "[blah]";
...
export TRACK = (track_foo | track_bar | ...);
export ALBUM = (album_foobar | album_blah | ...);
</verbatim>
This would then be imported by another grm file for use. This works fine for a relatively small number of entities but when I try to expand to more than ~4k it breaks down. Thanks!
Hi Richard, I've been playing around with StringFile but it seems to generate a state for each character instead of each token. Is there a way to tell it to generate states delimited on whitespace? Thanks!
This is the grammar file I'm using:
track_begin = "" : "[<track>]";
track_end = "" : "[</track>]";
TRACK_CATALOG = "[test]" "[one]" | "[test]" "[two]";
export catalog_grammar = (track_begin (TRACK_CATALOG) track_end);
export model = ArcSort[Optimize[catalog_grammar], 'input'];
which outputs this FST:
➜ fstprint catalog_grammar_model.bin
0 1 <epsilon> <track>
1 2 t t
2 3 e e
3 4 s s
4 5 t t
5 6 0x20 0x20
6 9 o o
6 7 t t
7 8 w w
8 11 o o
9 10 n n
10 11 e e
11 12 <epsilon> </track>
12
but I want something like this:
➜ fstprint catalog_grammar_model.bin
0 1 <epsilon> <track>
1 2 test test
2 3 one one
2 3 two two
3 4 <epsilon> </track>
my string file is just this:
test one
test two
What's the size of your vocabulary, defined as the number of distinct tokens in your list of songs, or whatever it is? Presumably it's a bit smaller than th list of songs. In that case you could take your Stringfile and then compose it with something that maps from the individual string tokens to your generated symbols:
map =
("test" : "[test]") |
("one" : "[one") |
...
;
mapper = map ((" " : "[<spc>]") map)*;
or something like that
Most efficient implementation of an OpenFst FST for a RESTful website
I've been out of Thrax/OpenFST for a while, so please excuse me if this has already been discussed. Please be gentle and point me to any previous discussion or examples.
Background:
Imagine that I have a website/webservice, based on XRX or whatever, that receives a word, or set of words, submitted by a user, and feeds each word to a morphological analyzer (say for Spanish, Aymara, Klingon, or Hopi). The morphological analyzer is implemented as an OpenFST FST, built with Thrax. The site would then return the results to the user.
Questions:
1. What is the most efficient way to implement the FST for such an interactive service?
2. Can the "final" FST applied to input actually be a delayed/lazy FST created with something like ComposeFst( &fst1, &fst2) to prevent a possible explosion in size? even if that means that it is less efficient to apply to input?
3. Is there some friendly mind-tuning documentation available on the possible and recommended uses of delayed/lazy FSTs in OpenFst?
Thanks, Ken
Hi Ken:
Before we worry too much about efficiency, do we know that doing the obvious thing is not efficient enough?
If not, then I can think of a few ways you could do this. If you want to do this as part of a web service I'd strongly recommend looking into Pynini, which gives you a full Python interface to all of this, and would presumably also allow you to interface easily to a Python web server library. In that case it presumably would be easiest to just break your cascade of FSTs up into convenient sized chunks and apply them serially to the input. I would guess that would be fast enough. If not then we can worry about getting fancier. There is no way to store a lazy implementation as such AFAIK, but something could be created on the fly in Pynini, I believe. It may be worth roping Kyle Gorman into the discussion here.
--R
Thanks, Richard. I'm just fishing around for ideas right now. Thanks for the suggestion of Pynini. As any users would be accessing the service over the Internet, which has its own delays, the efficiency of applying the FST(s) on the server is very unlikely to be critical. Explosions in FST size, however, have been known to occur in morphological analyzers, and I'll have fun getting some cascade of separate FSTs to operate serially, as you suggest.
I anticipate the typical users entering a word, or maybe a few words, at a time. As they might look up words manually in a paper dictionary, though the web service should be able to analyze arbitrarily inflected words, as long as the root is recognized.
I'd love to hear from Kyle Gorman if he has experience in this kind of application.
As for the third question, is there some documentation somewhere about the correct and desirable uses of delayed operations in OpenFST? Why were they provided? and how might they help me?
I hope that you and everyone on the list are well.
Ken
I was out of finite-state development for a year or more, and then I last used Thrax/OpenFst, in a modification of the Sparrowhawk tokenizer, in 2018, after which I was diverted again.
So it looks like I missed the whole Pynini Thing, which looks very promising. I'm reading up on it now. Thanks.
OK good. Meanwhile, unfortunately I don't think there's really anything of a tutorial nature with the delayed implementations. If it comes to that, let's discuss offline.
Cleaning out some old piles of files, I see now that I did notice Pynini when it came out. But I didn't have time for it then, and (at that time) it had been written for Python 2.7, reportedly not working as well with 3.X? Can I assume that Pynini now works well with 3.X?
I don't remember, that's a pretty old version, and you'd have to look back through the thrax releases to see which one matches.
Is there some reason you cannot just install the latest version?
Can you just upgrade your OpenFst dependency to the latest version? Other than that, I can only suggest you check the versions of Thrax here to see which version is listed for 1.3.3. As I say I no longer remember since that is quite old.
Hi there,
I've built a PDT that flips a string (A/B/C -> C/B/A).
I'd like to apply this PDT within some context (e.g. [BOS] [EOS]), using CDRewrite.
I've got no luck so far, rewrite fails.
I might be missing something.
Do you happen to have an example of usage of PDT + CDRewrite?
Is it even possible? How would you invoke thraxrewrite-tester with such setting?
Thanks
Sorry I missed this query earlier.
I don't think that can work. The construction for the context-dependent rule would not respect the semantics of the brackets so I am guessing what you would end up with would be nonsensical. However I have to admit I haven't thought about that before.
One of my students was confused to get a segfault in thraxrewrite-tester.
She had not exported any FSTs from her Thrax file, so the FAR was only 24 bytes long.
Could there be a more graceful failure / error message in this case? Thanks!
thraxmakedep starts with
#!/usr/bin/env python
but it seems that it ought to specify python2. There may also be other scripts like this.
One of my students had trouble running thraxmakedep because "python" on his (anaconda) system defaults to python3.
Hello world!
I succeeded in compilation of the packages OpenFST/NGram/Thrax with MinGW using Docker. The will find the Git repository here: https://github.com/wincentbalin/compile-static-openfst The resulting MinGW binaries are static, both for win32 and for win64. You will find them in the list of releases: https://github.com/wincentbalin/compile-static-openfst/releases
I had to create a couple of patches, which I then put into the repository above. Some of them got obsolete already, and hence deleted. I hope that we might incorporate some of them into the main source code. I suppose it is much more feasible than trying to fork and adapt every single version to MSVC only to abandon it later. There are much too much of such repositories on GitHub.
I am looking forward to any question or opinion!
P.S.: I posted the same message already in the OpenFST forum. Initially I did not intend to cross-post, but the thread in another forum did not get any reaction.
Thanks for doing this. The main problem with us incorporating these changes into the source is that none of us work with Windows, so we would have no way to test that changes on our end would not break something for MinGW. Also, it's not clear where we would stop. Some people have built native Windows binaries, using e.g. Visual Studio, and we simply do not have the bandwidth (or the machines running the relevant software) to support all of the variants.
Most projects live and die with their maintainers :-> , I aware of that. For now, I think that maintaining MinGW patches is not that difficult.
But, on other hand, I would like the developers of OpenFST and OpenGRM to look at the patches, especially for Thrax. Maybe there are some common problems therein, which could be eliminated altogether.
Time is the main issue. I did look briefly at your patches. I don't see an obvious problem, but then I know nothing about MinGW, and I don't know if some change we may make down the road will need special treatment. Again we have no means to support this. I think the maintainers of OpenFST and the rest of OpenGRM will tell you the same thing.
Compilation problems under Debian 9.2.1 -- possible solution
Hi,
When trying to compile Thrax 1.2.5 with OpenFST installed (1.6.7, with flag --enable-grm), I have encountered several problems with compilation:
1)
util/utils.cc: In function 'size_t thrax::{anonymous}::GetResultSize(const std::vector<std::__cxx11::basic_string<char> >&, size_t)':
util/utils.cc:44:16: error: 'accumulate' is not a member of 'std'
return (std::accumulate(elements.begin(), elements.end(), 0, lambda) +
putting line
#include <numeric>
in file src/lib/util/utils.cc
solves the problem
2)
In file included from walker/loader.cc:27:0:
./../include/thrax/features.h: In member function 'thrax::DataType* thrax::function::FeatureVector<Arc>::Execute(const std::vector<thrax::DataType*>&)':
./../include/thrax/features.h:462:38: error: 'kNoSymbol' is not a member of 'fst::SymbolTable'
if (label == fst::SymbolTable::kNoSymbol) {
I have checked file <fst/symbol-table.h>
And I think it should be fst::kNoSymbol instead of fst::SymbolTable::kNoSymbol.
I have changed line 462 of features.h to:
if (label == fst::kNoSymbol) {
And then compilation went OK.
Could you please investigate and try to reproduce that?
I have managed to reproduce it on other computer with Manjaro distro.
Patches below:
features.h
Apologies for not making it clear.
I wanted just to install Thrax on my VM with Debian 9.2.1 to use it for some NLP work within a project. These problems were on the 'make' stage of installation process.
I thought I'll share the info about the problem with installation and (maybe) a possible solution. I'd be grateful to know if that fix is correct, or maybe it should be done other way to get installation working.
Env: Debian 9.2.1, g++ 6.3.0
Best Regards,
Karol M.
Ok. Thanks.
Turns out this is a known problem: we just discovered this ourselves when our Ubuntu machines were upgraded.
A fix is in the works and we will update with a new version of Thrax soon.
OK we believe that version 1.2.6, now available on the download page, should fix this problem.
It also adds a couple of bits of functionality: lenient composition and optional weights in string files.
Hi,
These corrections worked for me but I've met another trouble. After configuration I noticed one mistake in libtool. During make stage (for both OpenFST and Thrax) I got:
libtool: unexpected EOF while looking for matching `"'
Changing :
eval sys_lib_search_path=\"$sys_lib_search_path_spec\"
eval sys_lib_dlsearch_path=\"$sys_lib_dlsearch_path_spec\"
To :
eval sys_lib_search_path='\$sys_lib_search_path_spec'
eval sys_lib_dlsearch_path='\$sys_lib_dlsearch_path_spec'
Solves the problem. I'm not also sure if it's your issue or libtool owners.
Ubuntu 16.04.4 LTS 64-bit, g++ 6.3.0
Best Regards, Sandra Ambroziak.
Hi,
RichardSproat:
I can confirm that 1.2.6 compiles without any problem both on Debian 9.2.1 and Manjaro.
Thanks for new release!
SandraAmbroziak:
I cannot confirm libtool problem neither on Debian 9.2.1 nor Arch/Manajaro while compiling/installing Thrax 1.2.6. So I bet it isn't Thrax src issue.
Best Regards,
Karol Mazurek
Great, thanks for confirming.
I've never seen that libtool problem before. That configuration has not changed, as far as I recall, through multiple releases. Almost looks like a shell problem.
Can you describe a use case of where you would want this in the grammar compiler itself?
For example if I define a cascade of rules as an FST, and if I then take the shortest path of that FST, I don't see what that would be useful for. So maybe if I had an idea of what you would want this for?
And then what would you do with that? Save it out in a far? If you just want to test that something has a particular output then the Assertions that are already available already do the shortest path and test that it's what you expect it to be.
If you want to save it out then no, the ShortestPath isn't provided right now, though if you invoke your rule with the input in the rewrite tester then the shortest path is what you will get.
Could some kind soul(s) please point me to any available information about or examples using Thrax to define a transducer that tokenizes a natural-language input text, e.g. to take a running English text and insert word-boundary symbols, with possible ambiguous outputs. Thanking you in anticipation...
Hi,
How can you use [BOS] and [EOS] in CDRewrite when using a user-defined alphabet? Suppose "minus", "point", "zero", and "one" are our symbols, and we want to convert "minus point one" to "minus zero point one".
s = SymbolTable['test.sym'];
remove_minus = CDRewrite["minus".s : "".s, "[BOS]".s, "".s, bytes.kBytes*];
implied_zero = CDRewrite[( "point".s : "zero point".s ), ( "[BOS]".s | "[BOS]minus".s ), "".s, bytes.kBytes*];
If you change "[BOS]".s to "[BOS]" then you get an error message about mismatched symbol tables for tau and lambda. But leaving it as "[BOS]".s means that I now have to include [BOS] and [BOS]minus as symbols in the alphabet (otherwise you get an error: "Failed to compile chunk", but I don't think those "[BOS]" should be in the symbol table), but after doing that, CDRewrite stops working correctly.
Am I misunderstanding something?
Is there a reason why underscores seem to not be allowed in certain formats? I have a user-defined alphabet the includes the symbols "p_h", "k_h", "t_h", "h_v", and "l_g", which do not seem to work. However, other symbols with underscores like "j_0", "b_c", and "n_(" all do. E.g.
Input string: p_h
Rewrite failed.
Input string: t_h
Rewrite failed.
Input string: k_h
Rewrite failed.
Input string: h_v
Rewrite failed.
Input string: w_0
Output string: w_0
Input string: w_0*
Output string: w_0*
Input string: t_(
Output string: t_(
Input string: n_(
Output string: n_(
This is within a dummy grammar that has our full alphabet set, but only one rule, so any single character should be passing through unaltered.
regroup_aspiration_voiceless_stops0 = ( " p " : " p_h *1 " );
aspiration_voiceless_stops0 = CDRewrite[regroup_aspiration_voiceless_stops0 , ( "." | ";" ) , "" , phones_star , 'ltr' , 'obl' ];
aspiration_voiceless_stops_stage = Optimize[aspiration_voiceless_stops0];
export PHONFST = Optimize[aspiration_voiceless_stops_stage];
Thanks for any tips!
Suppose I have a simple words_to_numbers.grm that, given a spelled-out number string, will return multiple possible interpretations for it:
<verbatim>
Input String: six twenty two
Output String: 622 <cost: 0.2>
Output String: 6 22 <cost: 0.4>
Output String: 620 2 <cost: 0.4>
</verbatim>
What I would like is to be able to map the output tokens to the input tokens. An example would be something like this:
<verbatim>
Output String: 622<"six twenty two"> <cost: 0.2>
Output String: 6<"six"> 22<"twenty two"> <cost: 0.4>
Output String: 620<"six twenty"> 2<"two"> <cost: 0.4>
</verbatim>
(or just provide the character positions of each new token, or anything else that could possibly help you do the mapping at a later stage)
You can't do this post-rewrite; it's impossible to know whether "(six) (twenty two)" transduced to "6 22", or "(six twenty) two".
I don't believe this is possible to do with `thraxrewrite-tester`, or just trying to add the markup in grammar rules. I've also looked at both thrax and open-fst code and tried to see what it takes to carry over the input states forward through rewrites but haven't had any success yet.
The grammars I'm working on are much more complicated than this example (400k nodes and millions of arcs for a very sophisticated NLU module) and being able to provide some sort of mapping between input and output is essential to be able to integrate thrax into the rest of the application.
Thank you very much for this incredibly useful tool, and any help or hints are greatly appreciated!
If you literally want the words in the output alongside the numbers that's a tad difficult since it involves copying at some level. You could use an MPDT for that, but there would be a big efficiency hit.
The best I can suggest is to write your own function that walks the paths in the resulting transducer. If you are careful in how you wrote your rules, then the transducer should contain the alignment between the input and the output words so that you could pick off the inputs and outputs and be confident that they align.
If you don't want to do it in C++ you might check out Pynini, which would allow you to do it in Python.
Just one more question: if I'm understanding correctly, the function that would walk the transducer path require changing the thrax code as opposed to open-fst, is that correct? i.e., it would be something similar to `rewrite-tester-utils.cc` in nature which, in addition to replacing the words, keeps track of their alignment.
Also, would you expect this to be simpler to do with Pynini as opposed to C++ and thrax? (as in, would Pynini's implementation make it more suitable for this purpose).
Thank you!
I wouldn't change the Thrax code per se. Just use the rule, Compose it with your input (converted to a trivial single-path acceptor) and then walk the resulting FST.
Yes, Pynini makes this a lot easier for you unless you love C++
I came across this interesting paper that uses a different approach for preserving alignments during transformation: http://www.aclweb.org/anthology/N10-1023
In short (section 3.2 and 3.3), by modifying the FST semiring to encode start and end character positions of states, and preserving them during transformation. If my understanding is correct, introducing such a change would require modifying open-fst and changing the arcs to capture those positions, and modifying the walker/matcher to preserve those positions during transformation (though I have no idea where that logic is yet). Is that correct? Or would it require other changes?
Thank you again!
IIRC Masha implemented that stuff internally, so yes it would presumably require some additional code.
Not clear to me why it would be a better solution to your problem than the one I suggested, however.
Thank you. The first approach (converting the grammar into a single-path acceptor) and composing with the FST works beautifully! It would require some changes to the grammar as you suggested, but that is easily done. I had some problems with C++, but it was very easy to do with the python extensions of OpenFst.
I figured it out! I tried the following:
$ thraxcompiler --input_grammar=test.grm --output_far=test.far
Evaluating rule: rule1
Evaluating rule: rule2
$ python
Python 2.7.13 (default, Mar 13 2017, 20:56:15)
[GCC 5.4.0] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pynini
>>> my_far = pynini.Far("test.far")
>>> print my_far.find("boo")
False
>>> print my_far.find("rule1")
True
>>> rule1=my_far.get_fst()
>>> print rule1
<snipped output of entire fst>
Sorry for the post! But hopefully it will help somebody.
Is Optimize[p] equivalent to
Minimize[Determinize[RmEpsilon[Minimize[Determinize[p]]]]]
That is, does it first determinize and minimize, treating Epsilon as a normal symbol, and then remove the epsilons, and determinize and minimize again?
OK. I've looked at the optimize.h code, and I see that the first thing it does is epsilon-removal (and then it performs summing of arc weights, determinization and minimization, encoding and decoding as necessary). I approached Cyril Allauzen ages ago, asking about optimization, and he pointed out that determinization and minimization treat epsilons as normal characters. If I recall correctly, he advocated doing determinization and minimization WITHOUT FIRST DOING EPSILON REMOVAL, then doing an epsilon removal, and then REdoing determinization and minimization. What's your take on when epsilon removal should be done?
We haven't experimented with what Cyril suggests. I suppose it might improve things in some cases. Are there pathological cases where you think this might help?
Not that I know of. Cyril's plan will certainly slow things down if optimization is performed by default (as in my Kleene language). Kleene's current $^optimize(...) function uses Cyril's plan. Perhaps I should rename it something like $^superOptimize() or $^cyrilOptimize() and reimplement the $^optimize() function to be more like Thrax's Optimize[].
Another issue in Optimize[]. I see that it performs StateMap(fst, ArcSumMapper<Arc>(*fst)) ; to sum arc weights. If I'm not mistaken, Determinize() by itself does such summing. Does summing the arc weights before determinization somehow speed things up?
Again, I don't know. But frankly I think we are down in the weeds here. First of all demonstrate that this makes a noticeable difference with a live example. Then we can discuss how to tweak it. We have had many ideas on this or that improvement that might help. Sometimes, as with the implicit grouping of cascaded rules within an Optimize[], it makes a huge difference: without that, for a long chain of compositions if one wrote
Optimize[rule1 @ rule2 @ .... @ rulen]
the result could be disastrously slow. So what the compiler does is group those in a binary right-branching tree. That made it massively more efficient at compile time. Could we do better? Probably, if we know something about the individual rule FSTs and then cleverly combine them in an order that optimizes the process: if for example I know that the intersection of range of rule_k and the domain of rule_k+1 filters things down to a much smaller set, then it would be good to combine those first. But in practice the binary branching tree seems to get you good enough results nearly all of the time. Most of the time when things break down it is because people are trying to do things that are inherently very bad anyway.
Thanks for the response. At Xerox too we found that composing a cascade of rules could be not only inefficient but could also easily explode in size. We found that if the rules were to be composed with an FST encoding a lexicon, it often helped to group the compositions in a left-branching tree ( ( ( ( lexicon @ rule1 ) @ rule2 ) @ rule3 ) @ rule4 ) etc. The lexicon effectively acted as a filter that often avoided the explosion.
In src/include/thrax/algo/optimize.h, the Optimize function has two arguments:
void Optimize(MutableFst<Arc> *fst, bool compute_props = false)
In a number of places in the optimization code, properties of the fst are queried with the fst->Properties(mask, compute_props) syntax. I understand that if compute_props is false (the default value), then the Properties (or the specific property?) are not recalculated. The question is this: Under what conditions should compute_props be set to true so that the Properties are recalculated?
Generally speaking, this will just allow you to avoid calls to Determinize and/or RmEpsilon when the FSTs are already deterministic and/or epsilon-free but they weren't made so by virtue of previous calls to Determinize and/or RmEpsilon (which set the properties bit). For instance: it doesn't necessarily know that string FSTs are deterministic and epsilon-free.
I think of it not as something to make it go faster so much as something to give it tighter optimization bounds. For instance if it doesn't know that the transducer is weighted-cycles free, but compute_props is true, it will test for that property before deciding how to encode the FST during determinization. So if you have to have the smallest possible FST it's a good choice.
Regarding your question about arc-sum mapping, it's only true that determinize does that during Optimize if it does the determinization using an unencoded FST. If the FST labels and/or weights are encoded (as they often are---depends on properties bits) you don't necessarily get arc-sum mapping as the result of determinization.
I've been trying to write CDRewite rules such as
CDRewrite[ "c":"d" | "a":"o" | "t":"g", "" , "" , sigma_star]
to map any and all 'c's to 'd's, 'a's to 'o's, and 't's to 'g's, including mapping "cat" to "dog", but it appears to be syntactically impossible in Thrax. Similarly,
CDRewrite[ "a":"b" | "b":"a", "", "", sigma_star]
should semantically, I think, map "abba" to "baab".
Is the restriction just syntactic? or also semantic? I can see parallel rules working in another system that allows alternation rules expressed with an FST.
AFAICT It works fine, assuming you put the parens around the replace operations
CDRewrite[ ("c":"d") | ("a":"o") | ("t":"g"), "" , "" , sigma_star];
rws-macbookair3:tmp rws$ thraxrewrite-tester --far=foo.far --rules=RULE --noutput=10
Input string: cat
Output string: dog
Input string: tacocat
Output string: gododog
Note --noutput=10, which would show any other output options, if there were any.
By default "abc" is interpreted as a byte string, which can be overridden by specifying "abc".utf8. Is it possible to specify somehow that strings are, by default, to be interpreted as utf8? E.g. some kind of declaration like
default_string_parse_mode utf8 ;
Is there documentation somewhere that specifies the precedence of the Thrax operators? In question are
<verbatim>
the unary * + ? and {n, m} postfixed operators
- (for subtraction)
| (denoting union)
: (cross product)
@ (composition)
concatenation (no operator, shown by simple juxtaposition)
</verbatim>
A special case might be weights, e.g., <1> and <2>. Do they attach with the same precedence as normal concatenation?
I've noodled away for a few hours, testing precedence, and here's the list as best I can judge right now (from High to Low precedence)
the unary postfix operators: * + ? {n,m}
concatenation (shown by juxtaposition)
- (minus)
@ (composition)
| (union)
: (cross-product)
The <...> weight syntax seems to have a special status. It can appear only at the "end" of a regular expression, i.e. at the very end, or at the end of a regular expression enclosed in parentheses.
Corrections would be welcome.
Hi,
I am working on a set of grammars and when trying to add some consistency checks I get an error for "Undefined function identifier: AssertNull". The other assert in the grammar are working just fine, any hint on what could be the issue?
Thank you very much!
I would need to see your grammar to know whether it's a bug in the grammar or a bug in Thrax itself. Can you send them to me? You can use my Google address, rws@google.com
Thanks for finding this, and mea maxima culpa for pushing this out with that bug. As a temporary fix please replace src/lib/walker/loader.cc with the attached loader.cc at the end of this page (i.e. http://openfst.cs.nyu.edu/twiki/pub/Forum/GrmThraxForum/loader.cc) and reinstall.
I will push out a fixed version of the distribution as soon as I can.
I want generate an automata for recognize inputs which consists a sequence of uint64_t type integers. I know how to use Thrax recognize byte(0~255) string, but I do not know how can I deal with this problem.
Hope you can help me! Thanks very much!
I am trying to convert Arabic numbers from text to digits. The problem is with numbers which are combined of decades and units((21,..,29), (31, ..,39), ..., (91, .., 99)) as we pronounce them in reverse order to how we write them as digits.
for example: Twenty one in Arabic is One twenty but still written 21. So, the output of the grammar would be 12 instead of 21.
how can I make the 12 to become 21? help!
You can't easily do that with FSTs in any general way unfortunately: unbounded string reversal is not a regular operation. The best you can do is handle cases up to a fixed length, which is equivalent to enumerating the cases you want to reverse.
However, the PDT extension would allow you to do this more generally. See
http://openfst.org/twiki/bin/view/GRM/ThraxQuickTour
under the Pushdown Transducers section. That gives an example of a^n b^n which is similar to your problem, which is also similar to w w-reverse. In your case you would need to define 10 bracket pairs (one for each digit) rather than just one, and rather than just accepting strings of the form w w-reverse, you need to make sure symbols in w are deleted and the appropriate comparable symbols in w-reverse are inserted.
Thanks a lot for your help and support.
I have solved it as you suggested without using PDT. wrote 9 rules for digits (1-9) as follows:
(one : "") decades ("" : "1") | ... | (nine : "") decades ("" : "9")
Regards.
I just discovered a bug in the underlying FAR reader code that causes a problem if one of your grammars only has functions and no exports.
It is perfectly legal in Thrax to have a grammar that only has functions, but if you try to import that grammar into another grammar the compiler will dump core due to an error apparently with the STTableFarReader.
This will get fixed hopefully soon, but in the meantime the workaround is to include a trivial export such as
export FOO = "a";
in your function file.
Is there a simple tool available for transforming standard input into standard ouput with a FAR (within OpenFST or Thrax or somewhere else)? I mean something like
<verbatim>
process-with-far transducer.far < in.txt > out.txt
</verbatim>
Of course, you've got thraxrewrite-tester, but it pollutes the output with "Input/Output string:" and it was not written with efficiency in mind. It wouldn't be difficult to hack it, but I am wondering whether anything else is available.
Not exactly, since you presumably want to select which FSTs to select from the far.
I will be releasing a new version of Thrax at some point (when I can get around to it) that will have various changes, and I could add that as a feature to the rewrite-tester. But that may not be soon enough for you.
Hi, i am using the thrax(1.0.2) and fst(1.3.4) in gcc version of 4.4.7(I need to make it work on this version). I have built the fst with --enable-far=yes --enable-pdt=yes, but i still get the following errors
//usr/local/lib/libthrax.so: undefined reference to `fst::IsSTList(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
//usr/local/lib/libthrax.so: undefined reference to `fst::IsSTTable(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
I use " g++ -g -O2 -std=c++0x -o nersuite nersuite-main.o nersuite-nersuite.o nersuite-FExtor.o nersuite-crfsuite2.o ../nersuite_common/libnersuite_common.a -lcrfsuite -llbfgs -lm -ldl -lfst -lthrax -Wl -lboost_unit_test_framework"
to compile my code. I also tried using the command in ldconfig.
Any help would be appreciated. Thank you
The latest version of Thrax is 1.1.0. Have you tried that version? It requires OpenFst 1.4.0, but that should work with your compiler. I would strongly recommend using that route. You also get more features in Thrax that way.
Richard 1.4.0 does not build with gcc 4.4.7 [Default RedHat Servers running RHEL 6.x].
./../include/fst/union.h:140: instantiated from ‘fst::UnionFst<A>::UnionFst(const fst::Fst<A>&, const fst::Fst<A>&) [with A = fst::ArcTpl<fst::LogWeightTpl<float> >]’
stl_pair.h:90: error: invalid conversion from ‘int’ to ‘const fst::Fst<fst::ArcTpl<fst::LogWeightTpl<float> > >*’
I see. I misread your version number.
Unfortunately in general it's a little hard to support older versions, with compilers changing and so forth. For me to reproduce your error would require me to replicate your set of conditions, which would include the out-of-date compiler you are using.
So I have two suggestions for you.
1) Upgrade your compiler to 4.7. Then you'll get the benefit of the latest version of OpenFst and the latest version of Thrax.
Or if you cannot do that, then:
2) Read further down on this page where you will find that someone reported what looks like the exact same error about a year and a half ago. See my reply dated 12 Jan 2014 - 13:40. See if my suggestion works.
In fact one of the reasons for forums like this is to archive these sorts of problems, so it's good to check if someone else has reported the same or a similar issue before posting.
On the issue of configuration options, the example above shows --enable-far=yes --enable-pdt=yes
Is that correct?
An example on http://www.cslu.ogi.edu/~sproatr/Courses/textNorm/tutorial.html shows something different: --enable-far=true
Should --enable-far all by itself work?
For OpenFst, ./configure --help lists optional features --enable-far and --enable-pdf without any suggestion that it needs or takes =yes or =true or =anything.
Thanks Ken for pointing these out. They were relics from an earlier version. I have corrected the error in the quick tour. I will correct errors in the config and other places when I do a release of a new version sometime soon.
Suppose I have a transducer that turns numbers into their spoken representation, e.g. 23 -> twenty-three. Now I want to handle US currency, so $23 becomes "twenty-three dollars". Obviously for "$1" it is "one dollar". To implement it in Thrax I might just add the whole string as the alternative path with the lower weight, but as I have many different units ("2m" -> "two meters", but "1m" meter) I wonder what would be the idiomatic way to implement pluralization? I feel like I should use Features and Paradigms, but I lack good examples of their application. Thank you.
You could use the features functionality, though for English this might be a bit of overkill. For simple cases like English I would just have two StringFiles, one for the singulars and one for the plurals, then define singular_nouns to use the first and plural_nouns the second, then just do the obvious combination with "1" versus all the other numbers.
If you wanted to use the features/paradigms functionality, there's an example for a more complex case in the distribution. See: src/grammars/paradigms_and_features.grm
Is it possible to do lookahead in the Thrax grm files? For example, require at least one digit, one lowercase, and one uppercase as in regex below:
( (?=.*\d) (?=.*[a-z]) (?=.*[A-Z]) .{6,20} )
Thanks
I'm not sure what you are trying to do, but you may just want to use a CDRewrite rule, which allows you to change one regexp to another in the context of two other regexps that are not considered part of the first two regexps.
Regex lookahead is not something that is implemented per se. But CDRewrite implements all of the functionality that one uses regexp lookahead in PCRE's for, as far as I can tell. If you want to detect a regular expression in the context of another regular expression and know that you have detected it, an easy way is to write a CDRewrite rule that inserts some marker after (or before) the first regular expression if it occurs in the context of the second regexp. This gives you all the functionality that the PCRE lookahead would give you.
Hi all,
I am new to Thrax and OpenFst and I would appreciate it a lot if you could help me with the following issue. I need to use my own symbol table with a PDT or to be able to extract the symbol table in a non-binary format. So far I was not able to do so as the fst extracted from my far has an empty symbol table.
Let me show you how I worked:
1. I created my grammar that will cover digits one to nine and I got the symbol table I use let's say with another fst.
numbers_en_US.grm
# Numbers simple grammar for en-US.
# Covers numbers 0 to 9
my_symbol_table=SymbolTable['numbers.txt'];
export PARENS = ("[<s>]" : "[</s>]");
space = " " ;
units = Optimize
[
("zero".my_symbol_table) |
("one".my_symbol_table) |
("two".my_symbol_table) |
("three".my_symbol_table) |
("four".my_symbol_table) |
("five".my_symbol_table) |
("six".my_symbol_table) |
("seven".my_symbol_table) |
("eight".my_symbol_table) |
("nine".my_symbol_table)
];
export NUMBERS = ("[<s>]" (units space)* units "[</s>]")* ;
numbers.txt
eight 0
extra1 1
extra2 2
<eps> 3
five 4
four 5
nine 6
one 7
</s> 8
<s> 9
seven 10
six 11
three 12
two 13
zero 14
2. Then I compiled my grammar, extracted the fst from the far and checked the fst info:
$ fstinfo NUMBERS
fst type vector
arc type standard
input symbol table none
output symbol table none
# of states 12
# of arcs 32
initial state 11
...
3. So as the symbol table is empty, when I test, it is impossible to get rewrites:
$ thraxrewrite-tester --far=numbers_en_US.far --rules=NUMBERS\$PARENS --output_mode=numbers.txt
Input string: one
Rewrite failed.
$ thraxrewrite-tester --far=numbers_en_US.far --rules=NUMBERS\$PARENS
Input string: one
Rewrite failed.
So, any ideas on how to use my symbol table? Or even how to get the internal symbol table in a non-binary format?
Thanks,
Sofia
The symbols generated for the PARENS will be in the FST named *StringFstSymbolTable, which you will see if you do a farextract on the far.
But it looks as if you are assuming two symbol tables here, one being your own, the other being the one that will be generated for those extended labels. I think what you want to do is something like this:
export PARENS = ("<s>".my_symbol_table : "</s>".my_symbol_table);
Then you need to run the compiler with the --save_symbols flag. Finally you will need to use the --input_mode and probably the --output_mode flags to thraxrewrite-tester with the argument being your symbol table.
If that still doesn't work, can you send me (rws@google.com) the complete set of files needed to build your target, and I will have a look.
--R
Hi Richard, I followed your advice but the .far I get with my symbol table is completely different from the one without it. Which is expected but "initial state 0" worries me for example. I will send you my set of files to get an idea.
Hi,
I downloaded openfst 1.4.1 and opengrm-ngram 1.2.1 but the latter won't compile on openSuse 13.1.
./configure says "configure: error: fst/extensions/far/far.h header not found"
however i find this file at /home/roger/sphinx/openfst-1.4.1/src/include/fst/extensions/far/far.h
compile&installation of openfst was successfull (as far i can tell yet)
do I need to add this path/header file somewhere?
Thanks
Roger
Hi all, nice to meet you!
Let me introduce myself, as I am new here. My name is Alexis and I am a computational linguist and software developer. I was very excited with the discovery of the Thrax framework and after a short investigation I decided this was my thing I immediately started digging into it, but unfortunately I was not able to find "real-world" examples of usage, which would have simplified my task.
However, I just kept going on. I have been working for Yandex and developing a rule-based system for generating Russian phonetic transcriptions (in the context of speech synthesis). My company has been very generous and allowed me to open source the rules I wrote.
Probably I do not even use half of the power of Thrax, but I managed to write a working rule-based system just sticking to the basics I thought this could be useful for someone else (as it would have been for myself at the beginning). That is why I thought I should post here about them. Please, take in account that this was my first try with Thrax and that I probably could have written the rules in a much better way, if I had more knowledge.
In case someone is interested, you will find them here: https://github.com/wilpert/RusPhonetizer/tree/master/grammars
Thrax was a wonderfully powerful and easy to use framework for my work, something I did not experience before. I am utterly thankful to the authors for their amazing achievement. And to Yandex for allowing me to share my work.
Thanks to you all and be happy
Alexis
Hi Alexis:
Glad it has proved useful to you. Yeah there are various toy examples around, but not much "real world" examples that I know of that are public, at least not yet.
I'll be happy to take a look sometime at your grammars and send along suggestions if I have any.
Richard Sproat
Hi Richard,
yes, it would be great if you would find any time to have a look at my grammars, any feedback would be terribly appreciated!
Thanks again for the software,
Alexis
I am trying to compile Thrax in a Ubuntu VM using VirtualBox. I have gcc 4.8.2 installed and compiled openfst with far and pet enabled and in shared mode. I have 1Gb of RAM dedicated to the VM. If I try ./configure --enable-shared, it fails because I run out of memory. If I try just ./configure and then make, everything seems to compile ok until I get an internal compilation error:
/bin/bash ../../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I./../include -std=c++0x -MT loader.lo -MD -MP -MF .deps/loader.Tpo -c -o loader.lo `test -f 'walker/loader.cc' || echo './'`walker/loader.cc
libtool: compile: g++ -DHAVE_CONFIG_H -I./../include -std=c++0x -MT loader.lo -MD -MP -MF .deps/loader.Tpo -c walker/loader.cc -fPIC -DPIC -o .libs/loader.o
g++: internal compiler error: Killed (program cc1plus)
Try commenting out the lines that refer to Log64Arc in src/include/thrax/function.h, viz
function.h:70:extern Registry<Function<fst::Log64Arc>* > kLog64ArcRegistry;
function.h:87: typedef name<fst::LogArc> Log64Arc ## name; function.h:88: REGISTER_LOGARC_FUNCTION(Log64Arc ## name)
(Obviously be careful in that #define REGISTER_GRM_FUNCTION to leave the continuation "\"s all happy.
The downside is you won't get log64 arcs. The upside is it should be smaller. The fact that it's running out of memory in compiling the loader makes me suspect that may be the problem because for each of the different arc types, all of the templated classes have to be expanded. This should reduce the size, therefore. If that still doesn't work, remove log arcs too. You won't likely be using them. Indeed, for precisely these sorts of issues I have been thinking of disabling those in future versions.
I did that and also had to comment out similar lines in src/lib/walker/evaluator-specialization.cc (lines 35 and 49-53).
I also tried taking out LogArc and all it's mentions in function.h and evaluator-specialization.cc. But I still get an internal compilation error.
Ok thanks.
So the question is why you aren't getting that by inheritance. This is the first time I've seen this problem and I have no idea where it has suddenly broken.
Hi Richard (etc.), using Thrax 1.1.0 (and with OpenFst 1.3.4 already installed), compilation fails while making the file `ast/identifier-node.cc` due to an issue in the `include/thrax/compat/utils.h` header. Here's the error:
/bin/sh ../../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I./../include -g -O2 -MT identifier-node.lo -MD -MP -MF .deps/identifier-node.Tpo -c -o identifier-node.lo `test -f 'ast/identifier-node.cc' || echo './'`ast/identifier-node.cc
libtool: compile: g++ -DHAVE_CONFIG_H -I./../include -g -O2 -MT identifier-node.lo -MD -MP -MF .deps/identifier-node.Tpo -c ast/identifier-node.cc -fno-common -DPIC -o .libs/identifier-node.o
In file included from ast/identifier-node.cc:22:
./../include/thrax/compat/utils.h:119:8: error: field has incomplete type
'char []'
char buf[];
^
I presume this is because buf[] doesn't have a length defined (nor is it initialized with a string), and when I change the line to
char buf[1024];
compilation goes through. (I'm not sure this is a sensible default; I spent no time trying to understand what this code is doing.)
I'd include a patch but it's one line.
Kyle
Just remove that line: that variable is not used. Apparently it's a holdover from some earlier implementation, and I just forgot to update it. I'll fix this in the next release.
Hi,
I am currently using thrax to extend my some features of an alignment tool I wrote for my g2p system.
The basic idea is that the user can specify some alignment correspondence rules and optional default penalties, and then these can be incorporated into the EM training process.
At present I have kind of hacked the functionality of the thraxcompiler command tool to read in the grammar, and then return the desired FST+symbol table to the alignment program.
EDIT: Maybe it makes more sense to just provide a couple of snippets:
GetFstFromGrammar
sy = SymbolTable['simple.syms'];
zero = "0".sy : "zero".sy;
units = ( "these're".sy : ( "these're".sy | "[these]" | "[these]" "are".sy ) );
split = ( "[these]" "are".sy : "these're".sy );
sigma = "<sigma>".sy : "<sigma>".sy;
abc = ( "a".sy "b c".sy : "a b b".sy );
export RULES = Optimize[ sigma* ( units | zero | abc ) sigma* ];
Here the 'sigma' is used in combination with a specialized 1-state alignment transducer that relies on RHO and SIGMA matchers.
Is there an alternative or recommended way to do this? It would be great if I could either specify the symbol table just once at the beginning, or automatically infer/generate the whole symbol table and return it - or even better modify the grammar from my C++ application to simply what the user is responsible for doing.
I went through the FAQ but did not notice any answers to these questions.
Thanks for your time.
UPDATE:
I solved this by creating some bindings with pybindgen and then writing a generator that interprets a simplified version of the Thrax grammar, then expands it to the versbose version with the extra quotes and symfile suffixes, etc.
I'd like to run this fst model for Inverse Text Normalization task. it is running on shell with
$ thraxrewrite-tester --far=main.far --rules=ITN < text.txt
and I need to use this in c++. so I did convert grm file to fst file with below
fstcompile --isymbols=$byte_sym --osymbols=$byte_sym ${fst}.fst.txt | fstarcsort --sort_type=olabe l - > ./${ODir}/${fst}.fst
so I have fst file to load..
but how could I call this fst model in C++ so that I could feed sequence of string as ITN input, and get ITN output?
and please share for the symboltable as well. Just for refering.
Thanks!
The best way to do that would be to link with the library and use GrmManager to load the far, and then you can specify whatever rules you want to apply. If you follow the example in the rewrite-tester that should give you an idea of how to do it.
Thanks Richard for the reply!
rewrite-tester example means thrax-1.2.2/src/grammars files.. right?
I did go through all and I built rewriter far and fst files
what I want is to use these files to load my other c/c++ program.
Thanks in advance!
No, that is not what I meant.
Look in src/bin at the code for rewrite tester. Then look and see what it does. Then figure out how to write similar code that uses the GrmManager in the same way to do what you want.
Hopefully that is clearer.
yes (openfst 1.3.4 compiled with --enable-far and some other enable options ), thrax compiled successfully,but compilation fails while making the file `batch_test.c` (extracted form export.tgz), can you me some advice
I'd like to but first I need to understand what is going on. I can't reproduce your error (apparently) and I don't know what batch_test.c is since it's not part of the Thrax distribution. Is this your own code? If so then I need to see EXACTLY what you are doing, including probably your sending me a directory with all of the additional code.
If this is part of the Thrax distribution then please tell me where it is because I can't find it (nor do I remember such a file).
thank you for your reply.
in this page:
http://openfst.cs.nyu.edu/twiki/bin/view/Contrib/ThraxContrib,
you can see
Projects using the OpenGrm Thrax tools:
export.tgz: Grammars and software developed as part of a text normalization class taught at the Center for Spoken Language Understanding, Fall 2011. URL for the course: http://www.cslu.ogi.edu/~sproatr/Courses/TextNorm/
i download "export.tgz" .
there is a file called batch_tester.cc in batch_tester directory(extract from export.tgz)。
Ok that helps. Yes, I did write that, but it wasn't obvious from your query that this is what you were referring to. Please in future give all necessary information when reporting a bug.
In the meantime I will have a look. I do not know off the top of my head what the problem is.
Ok it's the usual nonsense about ordering of shared object libraries. If you do things in this order it should work:
g++ -g -O2 -o batch_tester batch_tester.o -L/usr/local/lib/fst -lm -ldl -lfst -lthrax -Wl,--rpath -Wl,/usr/local/lib/fst -Wl,--rpath -Wl,/usr/local/lib/fst -Wl,--rpath -Wl,/usr/local/lib /usr/local/lib/fst/libfstfar.so
Evidently there is a bug in the configuration of the distribution that was not causing problems before, but is now. I will look into that, but in the meantime, please try linking manually as above.
So far I find thrax a very neat piece of software but I have two questions...
Can I somehow use probability semiring as weights, because it seems Thrax only allows specifying log and tropical semirings? How about the other ones... Or should I somehow postprocess the generated far file?
Another question: I tried to use "fstdraw" on a far file, but got: ERROR: FstHeader::Read: Bad FST header: example.far
Is this a version mismatch?
Sorry, I missed the earlier comment -- for some reason I didn't get email about it.
Unfortunately the restriction to Log and Tropical is due to a similar restriction in the fst library: the real semiring does not come predefined. The best suggestion would be to use Tropical and then just do the obvious e^-cost conversion.
Access control: