OpenGrm Thrax Forum

You need to be a registered user to participate in the discussions.
Log In or Register

You can start a new discussion here:

Help You can use the formatting commands describes in TextFormattingRules in your comment.
Tip, idea If you want to post some code, surround it with <verbatim> and </verbatim> tags.
Warning, important Auto-linking of WikiWords is now disabled in comments, so you can type VectorFst and it won't result in a broken link.
Warning, important You now need to use <br> to force new lines in your comment (unless inside verbatim tags). However, a blank line will automatically create a new paragraph.
Subject
Comment
Log In

Backreference to captured group in Thrax

MinchiusMaximus - 2024-10-02 - 09:13

Hello, just a quick question. Forgive me if it sounds naive. So, is it possibile to use backreference to captured groups in Thrax. Imagine you have a string such as: '12-10-1492' (the 12th of October 1492) and you want to write a Thrax grammar which is able to swap the position of months and days ('10-12-1492'). With a normal regex you would just match the whole string and capture days and month separately, and would rewrite the whole thing by swapping the position of the first 2 groups. (\d{2})-(\d{2})-(\d{4}) ---> $2-$1-$3

Is it possible to do something similar in Thrax?

RichardSproat - 2024-10-03 - 06:11

Not as such. By "normal regex" you mean PCRE's, but note that those have greater than regular power precisely because of these copy operations.

In Thrax there are multi-push-down transducers (MPDT's --- see under Multi-Pushdown Transducers in https://www.openfst.org/twiki/bin/view/GRM/ThraxQuickTour) which gives you the same power, but they are not particularly easy to set up, and not generally very efficient.

For the particular case you mention it would be much simpler just to do the brute force thing and simply transduce between each MM-DD and its equivalent DD-MM. Ugly, but it works.

MinchiusMaximus - 2024-10-07 - 10:16

I sse. Thanks, unfortunately for my case the ugly solution won't be effective. The date thing was just an example, in reality I have to operate among groups which are not limited to just 00-24 digits. Thanks anyway for the answer.

RichardSproat - 2024-10-08 - 06:02

Ah OK.

If you can share some more specifics about what you would like to do, I might be able to make a suggestion.

MinchiusMaximus - 2024-10-30 - 06:43

Hi, yes, that's very nice of you. So, imagine I have a list of currencies such as:

currencies = "euros" |"dollars"|"pounds"|"pesos"|"yens";

now imagine you have a string that you need to transform into a fully formatted thing: 5 euros and 37 cents ----> 5,37 euros

obviously I can write something like: CDRewrite["cents" : "euros", digit+ " euros and " digit+ " ", rb, star]

Though, if I want to have a similar effect for the other currencies, I should create a similar CDRewrite for each of them. What I want is instead a function which is able to create automatically a CDRewrite like that for each currency in my list. So, detecting any currency name, remembering which one is it and then move it at the end of the string, taking the place of the cents token.

RichardSproat - 2024-10-31 - 18:10

I don't have Thrax currently installed (oddly enough) so the following is in Pynini. But you can do something very similar in Thrax. There is no pretty solution, but one way is to do something along the following lines:

import pynini as py from pynini.lib import byte from pynini.lib import pynutil from pynini.lib import rewrite

def I(x): return pynutil.insert(x)

def D(x): return pynutil.delete(x)

MARKERS = py.union("$", "£", "€") SIGSTAR = py.closure(byte.BYTE) NON_CURRENCY = (SIGSTAR - (SIGSTAR + MARKERS + SIGSTAR)).optimize()

# Replace the singular and plural of the major currencies with a symbol that # remembers which one you had. MAJOR_INS = ( py.cross("dollar", "$") | py.cross("dollars", "$$") | py.cross("pound", "£") | py.cross("pounds", "££") | py.cross("euro", "€") | py.cross("euros", "€€") )

# Major currencies are just the input of the above MAJOR = py.project(MAJOR_INS, "input")

MINOR = py.union("penny", "pence", "cent", "cents")

# Perform the replacement REPLACE_MAJOR = py.cdrewrite( MAJOR_INS, "", " ", SIGSTAR, )

# Insert all markers at the end INS_MARKER = SIGSTAR + I(MARKERS) + I(MARKERS).ques

# Delete the currency terms. Make sure they are followed by space or # end of string so that it's greedy. DEL_CURRENCY = py.cdrewrite( D(MAJOR | MINOR), "", py.union(" ", "[EOS]", MARKERS), SIGSTAR, )

# Filter to make sure the first and second marker are the same. FILT = ( NON_CURRENCY + "$" + NON_CURRENCY + "$" | NON_CURRENCY + "$$" + NON_CURRENCY + "$$" | NON_CURRENCY + "£" + NON_CURRENCY + "£" | NON_CURRENCY + "££" + NON_CURRENCY + "££" | NON_CURRENCY + "€" + NON_CURRENCY + "€" | NON_CURRENCY + "€€" + NON_CURRENCY + "€€" )

# Replace the final marker with the appropriate currency term. REPLACE_MARKER = py.cdrewrite( py.invert(MAJOR_INS), "", "[EOS]", SIGSTAR, )

# Get rid of the other one and clean up to replace with "," DEL_MARKER = py.cdrewrite( py.cross(" " + MARKERS + MARKERS.ques + " and ", ","), "", "", SIGSTAR )

RULES = ( REPLACE_MAJOR @ INS_MARKER @ DEL_CURRENCY @ FILT @ REPLACE_MARKER @ DEL_MARKER ).optimize()

inp1 = "25 dollars and 33 cents" inp2 = "25 pounds and 22 pence" inp3 = "1 dollar and 33 cents" inp4 = "1 pound and 22 pence" inp5 = "1 euro and 33 cents"

for inp in [inp1, inp2, inp3, inp4, inp5]: print(f"{inp} --> {rewrite.one_top_rewrite(inp, RULES)}")

Output:

25 dollars and 33 cents --> 25,33 dollars 25 pounds and 22 pence --> 25,22 pounds 1 dollar and 33 cents --> 1,33 dollar 1 pound and 22 pence --> 1,22 pound 1 euro and 33 cents --> 1,33 euro

RichardSproat - 2024-10-31 - 18:11

Bad formatting above due to the Wiki but hopefully you get the idea.

RichardSproat - 2024-11-01 - 06:06

RichardSproat - 2024-11-01 - 06:06

<pre> Test </pre>

Log In

StringFile is segfaulting (even on example in documentation)

JasonEisner - 2022-12-02 - 13:13

Hi - my NLP students are unable to do an assignment that worked fine last year because of a segfault.

Using Thrax 1.3.5 with the tiny example at https://www.openfst.org/twiki/bin/view/GRM/ThraxQuickTour#:~:text=foo.sym , I get $ thraxcompiler --save_symbols --input_grammar=foo.grm --output_far=foo.far Evaluating rule: foo_syms Evaluating rule: foo Segmentation fault (core dumped)

I'm told that downgrading to Thrax 1.3.3 fixes this problem, but introduces other problems elsewhere in the homework (segfault in thraxrewrite-tester).

Log In

What is the best way to insert a long string before a rule?

DabingSun - 2022-04-23 - 20:35

I want to insert a long string before relu to mark that relu belongs to a specific type, but it will take me about n times the time (n is about the length of the string divided by 5, which is uncertain, but it does rise exponentially) when I use the following method: insert_types = ( ("":" token { types { detail_types: \"") relus ); graph_insert = Optimize[insert_types]; export INSERT = CDRewrite[graph_insert, "", "", b.kBytes*]; In addition, I tried to use cdwrite directly:

But there is still no improvement in the results export INSERT = CDRewrite[("":" token { types { detail_types: \""), "", relus, b.kBytes*] Therefore, I want to know if there is a smarter way to realize this function. The main purpose is to reduce time-consuming

NOTE:

The long character is about 35 English letters. My encoding method is byte * the character is about 35 English letters. My encoding method is byte*

DabingSun - 2022-04-23 - 20:40

Time consuming refers to the time consuming when running FST. I use compse two FSTS: input string FST and rule conversion FST
Log In

Sigma for UTF8

DonovanVoss - 2021-12-06 - 17:07

src/grammars/byte.grm exports kBytes which is the sigma for byte input. But what about UTF8 input?

(Granted, every valid UTF8-encoded code point is made up of bytes, but I don't know about just using kBytes as the UTF8 sigma.)

Is there a way to have OpenFST generate the sigma acceptor FST via command-line? Via code?

DabingSun - 2022-04-18 - 23:49

I wonder if you have solved this problem?

DabingSun - 2022-04-18 - 23:50

I wonder if you have solved this problem?
Log In

Troubles when running on Android

DanielSchnell - 2021-03-17 - 11:32

Hi, I have compiled OpenFST 1.8.1 and Thrax 1.3.6 for Android to integrate it per NDK into an app via JNI. I use only static libraries for linking and use basically the following configure arguments:

OpenFst: <verbatim> --enable-far --enable-grm --enable-fsts --disable-shared --enable-static CFLAGS="-fPIC" CXXFLAGS="-fPIC" </verbatim>

Thrax: <verbatim> CFLAGS="-fPIC" CXXFLAGS="--fPIC" --disable-shared </verbatim>

My code basically imitates what RewriteTester is doing, so I followed the initialization process from the original code quite rigidly.

Trouble is, some important FST types are not registered and when opening my .far file vie grm_.LoadArchive(), it bumps out with these messages on stdout:

<verbatim> ERROR: GenericRegister::GetEntry: dlopen failed: library "vector-fst.so" not found ERROR: Fst::Read: Unknown FST type vector (arc type = standard): <unspecified> ERROR: Unable to open FAR: /storage/emulated/0/Android/data/com.grammatek.simaromur/files/g2p/g2p.far FATAL: Check failed: "grm_.LoadArchive(m_farFile)" file: /Users/dschnell/AndroidStudioProjects/simaromur/app/src/main/cpp/g2p/G2P.cpp line: 58 </verbatim>

The file is definitely there, so I guess reading & interpreting it causes the trouble. I have tried to dig down the problem and compiled my code also with

<verbatim> #define FST_NO_DYNAMIC_LINKING 1 </verbatim>

to get rid of the dlopen error. But effectively same error.

There is a lot of static initialization going on in OpenFST and Thrax with e.g. Flags, which I basically don't need because Android apps don't have a command line and accordingly no argv, argc are fed into main, so maybe the problem is related to this. But I have created at least an argv vector with nothing in it to fulfill the interface requirements.

Is there a possibility to initialize everything from scratch directly, so that I can continue or how can I convince Thrax/OpenFst to know about ingredients of my .far file in another way ?

DanielSchnell - 2021-03-18 - 08:54

I solved my problems: when integrating libThrax, libOpenFST and other static libs into a shared object, one needs to link those via -Wl,--whole-archive / -Wl,--no-whole-archive. Otherwise most globals of openfst & thrax will be omitted from the shared object.

Anyway, I think thrax & openfst are doing too many things globally. There should be implicit static initializers instead of relying on the runtime system to call everything in the right order.

Log In

Segmentation fault when splitting string with symbols

AnderleichC - 2021-01-29 - 03:19

Hi, I've been trying for a while segmenting a string with symbols read from SymbolTable. Everything looks fine but when running 'make' it throws a segmentation fault. This is my symbol file:

<eps> 0 AH0 1 T 2 R 3 IH2 4 P 5 L 6 EY1 7 AA1 8 B 9 ...

And this is my grammar:

arpasyms = SymbolTable['arpabet.sym'];

export Consonant = Optimize[ "B".arpasyms | "CH".arpasyms | "D".arpasyms | "DH".arpasyms | "F".arpasyms | "G".arpasyms | "HH".arpasyms ...

The command I executed: thraxmakedep -s arpabet.grm make

The output: thraxcompiler --input_grammar=arpabet.grm --output_far=arpabet.far Evaluating rule: arpasyms Evaluating rule: Consonant Makefile:2: recipe for target 'arpabet.far' failed make: * [arpabet.far] Segmentation fault (core dumped)

AnderleichC - 2021-01-29 - 03:22

Sorry for the formatting.

The symbol file:

<eps> 0

AH0 1

T 2

R 3

IH2 4

P 5

L 6

EY1 7

AA1 8

B 9 ...

The command executed:

thraxmakedep -s arpabet.grm

make

The output:

thraxcompiler --input_grammar=arpabet.grm --output_far=arpabet.far

Evaluating rule: arpasyms

Evaluating rule: Consonant

Makefile:2: recipe for target 'arpabet.far' failed make: * [arpabet.far] Segmentation fault (core dumped)

KyleGorman - 2021-03-09 - 12:40

There was some strange case where strings parsed using symbol tables were generated segfaults, but only on certain platforms (MacOS with XCode and certain GCCs on Linux). While it's not something we can easily test (unless one of you wants to send me a new M1 Macbook Air...) we believe that release 1.3.6 has eliminated it. Please try that and report back when you can...
Log In

What is the best method to deal with large entity list?

JackPan - 2020-06-17 - 03:04

Hello, everyone. I've got a tricky problem when using thrax compiler. I have a large list of person names (over 1, 000, 000 words) and I want to use it in my rules? So I load it just like the tutorial recomends: person = StringFile['person.txt']; Then many rules reference this entity, for example: rule1 = "who is " person; rule2 = person " is a good man"; ... export RULE = Optimize[rule1 | rule2 | ... | rule500]; These rules may be more compilicated. When I use thrax to compile these rules, It will cause severe memory consumption and long compilation times, which can lead to compilation failures. Is there a problem? What is the best way to use it in this case? Thank you.

RichardSproat - 2020-06-17 - 06:09

Try breaking into smaller chunks and then unioning the pieces together, optimizing after each union.

JackPan - 2020-06-18 - 07:40

Hello, Richard. Thanks for your suggestion. I tried reducing the size of entity list (to 100, 000 words). But this also makes a failure with 500 rules and many references. I use fstdraw tool to draw the fst structure and found that thrax makes a copy of the entity fst in eatch reference location. This causes the number of states and edges of the entire large FST to explode and makes optimization take too long and memory to explode when call `Optimize` function. Is that right? If so, how can I use large entity lists with Thrax?

RichardSproat - 2020-06-19 - 06:03

Can you post your grammar and the string files you are trying to use somewhere? I can take a look. It is possible you'll have to split up the problem in some other way, but I need to see first.

Log In

How to match Reserve Keywords?

DanielleBerry - 2020-06-12 - 15:08

Hi all! First time posting a question here, apologies if this is a newbie question.

I have an FST trying to match a large list of song names and there's a song called "sigma" and, of course, there's also a song called "epsilon". This was causing the FST to match to an empty string and was a real head scratcher for a bit. I've removed these reserve keywords from the large song list for now and am no longer matching to the empty string.

However, I'd like to know how we can match to these words without them being interpreted by the FST as the reserve words? In the documentation "\" only escapes the letter following it. Is there a way to escape an entire token?

Any help would be much appreciated!

Thank you, Danielle

RichardSproat - 2020-06-13 - 06:03

I'd have to see exactly what you are doing. There is nothing in there that has those as reserved symbols.

Log In

thraxcompiler fails with 'Parse Failed: memory exhausted' when attempting to compile large grm file

DylanBannon - 2020-05-12 - 14:56

Hey all, I am trying to compile a large grm file (~5k patterns) but am getting the following error:

Parse Failed: memory exhausted ************************************** ************************************** Line 4929

Is there a way to give the compiler more memory? I don't see anything super relevant in the thraxcompiler help message but perhaps I'm missing something. Or maybe I'm just exceeding the limits of what the compiler can handle and need to break down the file?

using thrax version 1.2.3 and OpenFST version 1.6.3, running this command: thraxcompiler --save_symbols --input_grammar=<path to input> --output_far=<output path>

Thanks, Dylan

RichardSproat - 2020-05-13 - 06:06

Can you make what you are doing available so I can have a look? Most likely there is another way to factor things so this doesn't happen, but I'd have to see what you are doing first.

DylanBannon - 2020-05-13 - 09:44

Basically I'm auto-generating a grm file from a catalog of music entities (albums, tracks, and artists) to be able to match known entity names in a string. It takes the form of:

<verbatim> track_foo = "[foo]"; track_bar = "[bar]"; ...

album_foobar = "[foobar]"; album_blah = "[blah]"; ...

export TRACK = (track_foo | track_bar | ...); export ALBUM = (album_foobar | album_blah | ...); </verbatim>

This would then be imported by another grm file for use. This works fine for a relatively small number of entities but when I try to expand to more than ~4k it breaks down. Thanks!

RichardSproat - 2020-05-14 - 06:18

Have you looked at the StringFile option? If it's ultimately just a large list of strings, then that should work.

DylanBannon - 2020-05-20 - 10:48

Hi Richard, I've been playing around with StringFile but it seems to generate a state for each character instead of each token. Is there a way to tell it to generate states delimited on whitespace? Thanks!

This is the grammar file I'm using:

track_begin = "" : "[<track>]";

track_end = "" : "[</track>]";

TRACK_CATALOG = "[test]" "[one]" | "[test]" "[two]";

export catalog_grammar = (track_begin (TRACK_CATALOG) track_end);

export model = ArcSort[Optimize[catalog_grammar], 'input'];

which outputs this FST:

➜ fstprint catalog_grammar_model.bin

0 1 <epsilon> <track>

1 2 t t

2 3 e e

3 4 s s

4 5 t t

5 6 0x20 0x20

6 9 o o

6 7 t t

7 8 w w

8 11 o o

9 10 n n

10 11 e e

11 12 <epsilon> </track>

12

but I want something like this:

➜ fstprint catalog_grammar_model.bin

0 1 <epsilon> <track>

1 2 test test

2 3 one one

2 3 two two

3 4 <epsilon> </track>

my string file is just this:

test one

test two

RichardSproat - 2020-05-21 - 06:13

What's the size of your vocabulary, defined as the number of distinct tokens in your list of songs, or whatever it is? Presumably it's a bit smaller than th list of songs. In that case you could take your Stringfile and then compose it with something that maps from the individual string tokens to your generated symbols:

map = ("test" : "[test]") | ("one" : "[one") | ... ;

mapper = map ((" " : "[<spc>]") map)*;

or something like that

Log In

Most efficient implementation of an OpenFst FST for a RESTful website

KennethRBeesley - 2020-04-20 - 14:01

I've been out of Thrax/OpenFST for a while, so please excuse me if this has already been discussed. Please be gentle and point me to any previous discussion or examples.

Background: Imagine that I have a website/webservice, based on XRX or whatever, that receives a word, or set of words, submitted by a user, and feeds each word to a morphological analyzer (say for Spanish, Aymara, Klingon, or Hopi). The morphological analyzer is implemented as an OpenFST FST, built with Thrax. The site would then return the results to the user.

Questions:

1. What is the most efficient way to implement the FST for such an interactive service?

2. Can the "final" FST applied to input actually be a delayed/lazy FST created with something like ComposeFst( &fst1, &fst2) to prevent a possible explosion in size? even if that means that it is less efficient to apply to input?

3. Is there some friendly mind-tuning documentation available on the possible and recommended uses of delayed/lazy FSTs in OpenFst?

Thanks, Ken

RichardSproat - 2020-04-21 - 06:07

Hi Ken:

Before we worry too much about efficiency, do we know that doing the obvious thing is not efficient enough?

If not, then I can think of a few ways you could do this. If you want to do this as part of a web service I'd strongly recommend looking into Pynini, which gives you a full Python interface to all of this, and would presumably also allow you to interface easily to a Python web server library. In that case it presumably would be easiest to just break your cascade of FSTs up into convenient sized chunks and apply them serially to the input. I would guess that would be fast enough. If not then we can worry about getting fancier. There is no way to store a lazy implementation as such AFAIK, but something could be created on the fly in Pynini, I believe. It may be worth roping Kyle Gorman into the discussion here.

--R

KennethRBeesley - 2020-04-21 - 15:23

Thanks, Richard. I'm just fishing around for ideas right now. Thanks for the suggestion of Pynini. As any users would be accessing the service over the Internet, which has its own delays, the efficiency of applying the FST(s) on the server is very unlikely to be critical. Explosions in FST size, however, have been known to occur in morphological analyzers, and I'll have fun getting some cascade of separate FSTs to operate serially, as you suggest.

I anticipate the typical users entering a word, or maybe a few words, at a time. As they might look up words manually in a paper dictionary, though the web service should be able to analyze arbitrarily inflected words, as long as the root is recognized.

I'd love to hear from Kyle Gorman if he has experience in this kind of application.

As for the third question, is there some documentation somewhere about the correct and desirable uses of delayed operations in OpenFST? Why were they provided? and how might they help me?

I hope that you and everyone on the list are well.

Ken

KennethRBeesley - 2020-04-21 - 22:37

I was out of finite-state development for a year or more, and then I last used Thrax/OpenFst, in a modification of the Sparrowhawk tokenizer, in 2018, after which I was diverted again. So it looks like I missed the whole Pynini Thing, which looks very promising. I'm reading up on it now. Thanks.

RichardSproat - 2020-04-22 - 02:47

OK good. Meanwhile, unfortunately I don't think there's really anything of a tutorial nature with the delayed implementations. If it comes to that, let's discuss offline.

KennethRBeesley - 2020-04-29 - 09:41

Cleaning out some old piles of files, I see now that I did notice Pynini when it came out. But I didn't have time for it then, and (at that time) it had been written for Python 2.7, reportedly not working as well with 3.X? Can I assume that Pynini now works well with 3.X?

RichardSproat - 2020-04-30 - 06:36

Yes, it's all Python 3.X compatible.

Log In

Tthrax version compatible with openfst 1.3.3

NidhiHooda - 2020-02-28 - 17:26

Which version of thrax is compatible with openfst 1.3.3?

RichardSproat - 2020-02-29 - 06:15

I don't remember, that's a pretty old version, and you'd have to look back through the thrax releases to see which one matches.

Is there some reason you cannot just install the latest version?

NidhiHooda - 2020-03-03 - 13:04

There is an old openfst (1.3.3) dependency in my package which isn't letting me use the latest version of thrax.

RichardSproat - 2020-03-04 - 06:25

Can you just upgrade your OpenFst dependency to the latest version? Other than that, I can only suggest you check the versions of Thrax here to see which version is listed for 1.3.3. As I say I no longer remember since that is quite old.

NidhiHooda - 2020-03-05 - 17:12

Neither http://www.openfst.org/twiki/bin/view/FST/FstDownload nor http://www.openfst.org/twiki/bin/view/GRM/ThraxDownload , says which version of Thrax is compatible with which version of OpenFST. What do you mean by - I can only suggest you check the versions of Thrax here to see which version is listed for 1.3.3?

RichardSproat - 2020-03-06 - 06:03

http://www.openfst.org/twiki/pub/GRM/ThraxDownload/NEWS

NidhiHooda - 2020-03-09 - 16:15

Thanks Richard.
Log In

Apply PDT substitution within CDRewrite

ThraxUser - 2019-11-06 - 05:23

Hi there, I've built a PDT that flips a string (A/B/C -> C/B/A). I'd like to apply this PDT within some context (e.g. [BOS] [EOS]), using CDRewrite.

I've got no luck so far, rewrite fails. I might be missing something.

Do you happen to have an example of usage of PDT + CDRewrite? Is it even possible? How would you invoke thraxrewrite-tester with such setting?

Thanks

RichardSproat - 2020-02-29 - 06:14

Sorry I missed this query earlier.

I don't think that can work. The construction for the context-dependent rule would not respect the semantics of the brackets so I am guessing what you would end up with would be nonsensical. However I have to admit I haven't thought about that before.

Log In

Patch for batch_tester.cc from export3.tgz

WincentBalin - 2019-05-06 - 23:06

Should you wish to compile batch_tester from the export3.tgz archive at the Thrax contributed projects page http://www.openfst.org/twiki/bin/view/Contrib/ThraxContrib, you might use this patch: https://gist.github.com/wincentbalin/4a14a831c1373995b92826df8178b47d

WincentBalin - 2019-05-06 - 23:07

Of course, you should use the patch only if your compiler (in my case: gcc) throws errors about unknown namespaces.
Log In

segfault on empty FAR

JasonEisner - 2018-12-05 - 15:09

One of my students was confused to get a segfault in thraxrewrite-tester. She had not exported any FSTs from her Thrax file, so the FAR was only 24 bytes long. Could there be a more graceful failure / error message in this case? Thanks!

Log In

python2 vs. python3

JasonEisner - 2018-11-26 - 16:11

thraxmakedep starts with #!/usr/bin/env python but it seems that it ought to specify python2. There may also be other scripts like this.

One of my students had trouble running thraxmakedep because "python" on his (anaconda) system defaults to python3.

Log In

Successful cross-compilation of OpenFST, OpenGRM NGram and OpenGRM Thrax using MinGW

WincentBalin - 2018-03-21 - 17:27

Hello world! smile

I succeeded in compilation of the packages OpenFST/NGram/Thrax with MinGW using Docker. The will find the Git repository here: https://github.com/wincentbalin/compile-static-openfst The resulting MinGW binaries are static, both for win32 and for win64. You will find them in the list of releases: https://github.com/wincentbalin/compile-static-openfst/releases

I had to create a couple of patches, which I then put into the repository above. Some of them got obsolete already, and hence deleted. I hope that we might incorporate some of them into the main source code. I suppose it is much more feasible than trying to fork and adapt every single version to MSVC only to abandon it later. There are much too much of such repositories on GitHub.

I am looking forward to any question or opinion!

P.S.: I posted the same message already in the OpenFST forum. Initially I did not intend to cross-post, but the thread in another forum did not get any reaction.

RichardSproat - 2018-03-22 - 09:08

Thanks for doing this. The main problem with us incorporating these changes into the source is that none of us work with Windows, so we would have no way to test that changes on our end would not break something for MinGW. Also, it's not clear where we would stop. Some people have built native Windows binaries, using e.g. Visual Studio, and we simply do not have the bandwidth (or the machines running the relevant software) to support all of the variants.

WincentBalin - 2018-03-22 - 17:27

Most projects live and die with their maintainers :-> , I aware of that. For now, I think that maintaining MinGW patches is not that difficult.

But, on other hand, I would like the developers of OpenFST and OpenGRM to look at the patches, especially for Thrax. Maybe there are some common problems therein, which could be eliminated altogether.

RichardSproat - 2018-03-23 - 09:10

Time is the main issue. I did look briefly at your patches. I don't see an obvious problem, but then I know nothing about MinGW, and I don't know if some change we may make down the road will need special treatment. Again we have no means to support this. I think the maintainers of OpenFST and the rest of OpenGRM will tell you the same thing.

Log In

Compilation problems under Debian 9.2.1 -- possible solution

KarolMazurek - 2018-03-12 - 16:47

Hi,

When trying to compile Thrax 1.2.5 with OpenFST installed (1.6.7, with flag --enable-grm), I have encountered several problems with compilation:

1)

util/utils.cc: In function 'size_t thrax::{anonymous}::GetResultSize(const std::vector<std::__cxx11::basic_string<char> >&, size_t)':
util/utils.cc:44:16: error: 'accumulate' is not a member of 'std'
   return (std::accumulate(elements.begin(), elements.end(), 0, lambda) +

putting line

#include <numeric>
in file src/lib/util/utils.cc

solves the problem

2)

In file included from walker/loader.cc:27:0:
./../include/thrax/features.h: In member function 'thrax::DataType* thrax::function::FeatureVector<Arc>::Execute(const std::vector<thrax::DataType*>&)':
./../include/thrax/features.h:462:38: error: 'kNoSymbol' is not a member of 'fst::SymbolTable'
       if (label == fst::SymbolTable::kNoSymbol) {

I have checked file <fst/symbol-table.h> And I think it should be fst::kNoSymbol instead of fst::SymbolTable::kNoSymbol.

I have changed line 462 of features.h to:

if (label == fst::kNoSymbol) {
And then compilation went OK.

Could you please investigate and try to reproduce that?

I have managed to reproduce it on other computer with Manjaro distro.

Patches below: features.h

--- thrax-1.2.5/src/include/thrax/features.h   2018-01-28 18:37:28.000000000 +0100
+++ thrax-fixed/src/include/thrax/features.h   2018-03-12 21:30:07.236917900 +0100
@@ -459,7 +459,7 @@
         return nullptr;
       }
       int64 label = generated_symbols->Find(featval);
-      if (label == fst::SymbolTable::kNoSymbol) {
+      if (label == fst::kNoSymbol) {
         std::cout << "Feature/value pair " << featval << " is not defined."
                   << std::endl;
         delete generated_symbols;

utils.cc

--- thrax-1.2.5/src/lib/util/utils.cc   2018-01-18 01:22:59.000000000 +0100
+++ thrax-fixed/src/lib/util/utils.cc   2018-03-12 21:29:17.702287413 +0100
@@ -23,7 +23,7 @@
 #include <fstream>
 #include <string>
 #include <vector>
-
+#include <numeric>
 // For Cygwin and other installations that do not define ACCESSPERMS (thanks to
 // Damir Cavar).
 #ifndef ACCESSPERMS

SHA1 sums of downloaded archives were:

cdc35ed2b25413d3a56c7dda67667f7e58412cf4  thrax-1.2.5.tar.gz
b6c2771c8deee6879a2c98a0c975b078f59c7dd7  openfst-1.6.7.tar.gz

RichardSproat - 2018-03-13 - 09:02

What are you compiling on?

KarolMazurek - 2018-03-13 - 15:27

Apologies for not making it clear. I wanted just to install Thrax on my VM with Debian 9.2.1 to use it for some NLP work within a project. These problems were on the 'make' stage of installation process. I thought I'll share the info about the problem with installation and (maybe) a possible solution. I'd be grateful to know if that fix is correct, or maybe it should be done other way to get installation working. Env: Debian 9.2.1, g++ 6.3.0

Best Regards,

Karol M.

RichardSproat - 2018-03-13 - 19:01

Ok. Thanks.

Turns out this is a known problem: we just discovered this ourselves when our Ubuntu machines were upgraded.

A fix is in the works and we will update with a new version of Thrax soon.

RichardSproat - 2018-03-14 - 11:56

OK we believe that version 1.2.6, now available on the download page, should fix this problem.

It also adds a couple of bits of functionality: lenient composition and optional weights in string files.

SandraAmbroziak - 2018-03-14 - 14:08

Hi, These corrections worked for me but I've met another trouble. After configuration I noticed one mistake in libtool. During make stage (for both OpenFST and Thrax) I got: libtool: unexpected EOF while looking for matching `"'

Changing : eval sys_lib_search_path=\"$sys_lib_search_path_spec\" eval sys_lib_dlsearch_path=\"$sys_lib_dlsearch_path_spec\" To : eval sys_lib_search_path='\$sys_lib_search_path_spec' eval sys_lib_dlsearch_path='\$sys_lib_dlsearch_path_spec' Solves the problem. I'm not also sure if it's your issue or libtool owners.

Ubuntu 16.04.4 LTS 64-bit, g++ 6.3.0 Best Regards, Sandra Ambroziak.

KarolMazurek - 2018-03-14 - 16:51

Hi,

RichardSproat:

I can confirm that 1.2.6 compiles without any problem both on Debian 9.2.1 and Manjaro. Thanks for new release!

SandraAmbroziak: I cannot confirm libtool problem neither on Debian 9.2.1 nor Arch/Manajaro while compiling/installing Thrax 1.2.6. So I bet it isn't Thrax src issue.

Best Regards, Karol Mazurek

RichardSproat - 2018-03-14 - 18:08

Great, thanks for confirming.

I've never seen that libtool problem before. That configuration has not changed, as far as I recall, through multiple releases. Almost looks like a shell problem.

Log In

Public Examples of Thrax with a UTF8 Alphabet?

KennethRBeesley - 2018-01-11 - 14:23

Are there any publicly available examples of Thrax including a UTF8 alphabet? I.e., not limited to byte?

Log In

Expose ShortestPath() and RandGen() in Thrax?

KennethRBeesley - 2018-01-10 - 19:09

Before I dive into C++ (not my native language), has anyone already written C++ functions to add OpenFst's ShortestPath() and RandGen() to Thrax?

RichardSproat - 2018-01-11 - 09:04

Can you describe a use case of where you would want this in the grammar compiler itself?

For example if I define a cascade of rules as an FST, and if I then take the shortest path of that FST, I don't see what that would be useful for. So maybe if I had an idea of what you would want this for?

KennethRBeesley - 2018-01-11 - 14:21

E.g.: First create a cascade of Rules as an FST, then an Input as an FST, then compute ShortestPath[ Project[(Input @ Rules), 'output'] ]

RichardSproat - 2018-01-12 - 10:13

And then what would you do with that? Save it out in a far? If you just want to test that something has a particular output then the Assertions that are already available already do the shortest path and test that it's what you expect it to be.

If you want to save it out then no, the ShortestPath isn't provided right now, though if you invoke your rule with the input in the rewrite tester then the shortest path is what you will get.

Log In

Text tokenization using Thrax

KennethRBeesley - 2018-01-02 - 14:36

Could some kind soul(s) please point me to any available information about or examples using Thrax to define a transducer that tokenizes a natural-language input text, e.g. to take a running English text and insert word-boundary symbols, with possible ambiguous outputs. Thanking you in anticipation...

RichardSproat - 2018-01-03 - 09:21

Perhaps something along the following lines: https://github.com/google/sparrowhawk and grammars contained therein.

KennethRBeesley - 2018-01-03 - 12:53

Thanks. I found Sparrowhawk yesterday, and it appears very promising. But, as far as I can see, it seems limited to English. Am I missing something?

RichardSproat - 2018-01-04 - 09:07

The provided example grammar is limited to English. Nothing else about it is. Anyway, your question had English as an example smile
Log In

[BOS] and [EOS] in user-defined alphabet

PooriaAzimi - 2017-11-30 - 00:25

Hi,

How can you use [BOS] and [EOS] in CDRewrite when using a user-defined alphabet? Suppose "minus", "point", "zero", and "one" are our symbols, and we want to convert "minus point one" to "minus zero point one".

s = SymbolTable['test.sym'];

remove_minus = CDRewrite["minus".s : "".s, "[BOS]".s, "".s, bytes.kBytes*];

implied_zero = CDRewrite[( "point".s : "zero point".s ), ( "[BOS]".s | "[BOS]minus".s ), "".s, bytes.kBytes*];

If you change "[BOS]".s to "[BOS]" then you get an error message about mismatched symbol tables for tau and lambda. But leaving it as "[BOS]".s means that I now have to include [BOS] and [BOS]minus as symbols in the alphabet (otherwise you get an error: "Failed to compile chunk", but I don't think those "[BOS]" should be in the symbol table), but after doing that, CDRewrite stops working correctly.

Am I misunderstanding something?

Log In

Underscore in user-defined alphabet

KevinCrooks - 2017-11-28 - 17:41

Is there a reason why underscores seem to not be allowed in certain formats? I have a user-defined alphabet the includes the symbols "p_h", "k_h", "t_h", "h_v", and "l_g", which do not seem to work. However, other symbols with underscores like "j_0", "b_c", and "n_(" all do. E.g.

Input string: p_h Rewrite failed. Input string: t_h Rewrite failed. Input string: k_h Rewrite failed. Input string: h_v Rewrite failed. Input string: w_0 Output string: w_0 Input string: w_0* Output string: w_0* Input string: t_( Output string: t_( Input string: n_( Output string: n_(

This is within a dummy grammar that has our full alphabet set, but only one rule, so any single character should be passing through unaltered.

regroup_aspiration_voiceless_stops0 = ( " p " : " p_h *1 " ); aspiration_voiceless_stops0 = CDRewrite[regroup_aspiration_voiceless_stops0 , ( "." | ";" ) , "" , phones_star , 'ltr' , 'obl' ]; aspiration_voiceless_stops_stage = Optimize[aspiration_voiceless_stops0]; export PHONFST = Optimize[aspiration_voiceless_stops_stage];

Thanks for any tips!

KevinCrooks - 2017-11-28 - 18:01

Sorry about the poor formatting:

Input string: p_h<br>Rewrite failed.<br>Input string: t_h<br>Rewrite failed.<br>Input string: k_h<br>Rewrite failed.<br>Input string: h_v<br>Rewrite failed.<br>Input string: w_0<br>Output string: w_0<br>Input string: w_0*<br>Output string: w_0*<br>Input string: t_(<br>Output string: t_(<br>Input string: n_(<br>Output string: n_(<br>

<br> regroup_aspiration_voiceless_stops0 = ( " p " : " p_h *1 " );<br><br>aspiration_voiceless_stops0 = CDRewrite[regroup_aspiration_voiceless_stops0 , ( "." | ";" ) , "" , phones_star , 'ltr' , 'obl' ];<br>aspiration_voiceless_stops_stage = Optimize[aspiration_voiceless_stops0];<br><br>export PHONFST = Optimize[aspiration_voiceless_stops_stage];

RichardSproat - 2017-11-29 - 09:12

"p_h" is not going to be a user-defined symbol unless you do this: "[p_h]".
Log In

Mapping between input and output tokens

PooriaAzimi - 2017-10-20 - 18:58

Suppose I have a simple words_to_numbers.grm that, given a spelled-out number string, will return multiple possible interpretations for it:

<verbatim> Input String: six twenty two

Output String: 622 <cost: 0.2> Output String: 6 22 <cost: 0.4> Output String: 620 2 <cost: 0.4> </verbatim>

What I would like is to be able to map the output tokens to the input tokens. An example would be something like this:

<verbatim> Output String: 622<"six twenty two"> <cost: 0.2> Output String: 6<"six"> 22<"twenty two"> <cost: 0.4> Output String: 620<"six twenty"> 2<"two"> <cost: 0.4> </verbatim>

(or just provide the character positions of each new token, or anything else that could possibly help you do the mapping at a later stage)

You can't do this post-rewrite; it's impossible to know whether "(six) (twenty two)" transduced to "6 22", or "(six twenty) two".

I don't believe this is possible to do with `thraxrewrite-tester`, or just trying to add the markup in grammar rules. I've also looked at both thrax and open-fst code and tried to see what it takes to carry over the input states forward through rewrites but haven't had any success yet.

The grammars I'm working on are much more complicated than this example (400k nodes and millions of arcs for a very sophisticated NLU module) and being able to provide some sort of mapping between input and output is essential to be able to integrate thrax into the rest of the application.

Thank you very much for this incredibly useful tool, and any help or hints are greatly appreciated!

PooriaAzimi - 2017-10-20 - 19:02

^ the formatting seems to be off; here's a slightly better formatted version of the post: https://gist.github.com/anonymous/522156df4ce78f2592805c8f417c5687

RichardSproat - 2017-10-21 - 13:24

If you literally want the words in the output alongside the numbers that's a tad difficult since it involves copying at some level. You could use an MPDT for that, but there would be a big efficiency hit.

The best I can suggest is to write your own function that walks the paths in the resulting transducer. If you are careful in how you wrote your rules, then the transducer should contain the alignment between the input and the output words so that you could pick off the inputs and outputs and be confident that they align.

If you don't want to do it in C++ you might check out Pynini, which would allow you to do it in Python.

PooriaAzimi - 2017-10-21 - 20:55

OK, that's very helpful. Thank you!

PooriaAzimi - 2017-10-21 - 21:10

Just one more question: if I'm understanding correctly, the function that would walk the transducer path require changing the thrax code as opposed to open-fst, is that correct? i.e., it would be something similar to `rewrite-tester-utils.cc` in nature which, in addition to replacing the words, keeps track of their alignment.

Also, would you expect this to be simpler to do with Pynini as opposed to C++ and thrax? (as in, would Pynini's implementation make it more suitable for this purpose).

Thank you!

RichardSproat - 2017-10-22 - 09:22

I wouldn't change the Thrax code per se. Just use the rule, Compose it with your input (converted to a trivial single-path acceptor) and then walk the resulting FST.

Yes, Pynini makes this a lot easier for you unless you love C++ smile

PooriaAzimi - 2017-10-30 - 16:38

I came across this interesting paper that uses a different approach for preserving alignments during transformation: http://www.aclweb.org/anthology/N10-1023

In short (section 3.2 and 3.3), by modifying the FST semiring to encode start and end character positions of states, and preserving them during transformation. If my understanding is correct, introducing such a change would require modifying open-fst and changing the arcs to capture those positions, and modifying the walker/matcher to preserve those positions during transformation (though I have no idea where that logic is yet). Is that correct? Or would it require other changes?

Thank you again!

RichardSproat - 2017-10-31 - 12:00

IIRC Masha implemented that stuff internally, so yes it would presumably require some additional code.

Not clear to me why it would be a better solution to your problem than the one I suggested, however.

PooriaAzimi - 2017-11-29 - 23:58

Thank you. The first approach (converting the grammar into a single-path acceptor) and composing with the FST works beautifully! It would require some changes to the grammar as you suggested, but that is easily done. I had some problems with C++, but it was very easy to do with the python extensions of OpenFst.
Log In

Using Thrax compiled grammars with Pynini

ButteredGroove - 2017-07-06 - 18:29

Is there a way to use Thrax output, such as a FAR from thraxcompiler, as input into pynini?

ButteredGroove - 2017-07-06 - 20:21

I figured it out! I tried the following: $ thraxcompiler --input_grammar=test.grm --output_far=test.far Evaluating rule: rule1 Evaluating rule: rule2

$ python Python 2.7.13 (default, Mar 13 2017, 20:56:15) [GCC 5.4.0] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> import pynini >>> my_far = pynini.Far("test.far") >>> print my_far.find("boo") False >>> print my_far.find("rule1") True >>> rule1=my_far.get_fst() >>> print rule1 <snipped output of entire fst>

Sorry for the post! But hopefully it will help somebody.

RichardSproat - 2017-07-07 - 09:01

Yep, you got it.

Log In

What does Optimize[] do

KennethRBeesley - 2017-06-12 - 22:06

Is Optimize[p] equivalent to Minimize[Determinize[RmEpsilon[Minimize[Determinize[p]]]]] That is, does it first determinize and minimize, treating Epsilon as a normal symbol, and then remove the epsilons, and determinize and minimize again?

RichardSproat - 2017-06-13 - 09:03

May be easiest if you look at src/include/thrax/algo/optimize.h, to see what it does.

KennethRBeesley - 2017-11-12 - 11:30

OK. I've looked at the optimize.h code, and I see that the first thing it does is epsilon-removal (and then it performs summing of arc weights, determinization and minimization, encoding and decoding as necessary). I approached Cyril Allauzen ages ago, asking about optimization, and he pointed out that determinization and minimization treat epsilons as normal characters. If I recall correctly, he advocated doing determinization and minimization WITHOUT FIRST DOING EPSILON REMOVAL, then doing an epsilon removal, and then REdoing determinization and minimization. What's your take on when epsilon removal should be done?

RichardSproat - 2017-11-13 - 09:27

We haven't experimented with what Cyril suggests. I suppose it might improve things in some cases. Are there pathological cases where you think this might help?

KennethRBeesley - 2017-11-13 - 10:32

Not that I know of. Cyril's plan will certainly slow things down if optimization is performed by default (as in my Kleene language). Kleene's current $^optimize(...) function uses Cyril's plan. Perhaps I should rename it something like $^superOptimize() or $^cyrilOptimize() and reimplement the $^optimize() function to be more like Thrax's Optimize[].

KennethRBeesley - 2017-11-13 - 10:38

Another issue in Optimize[]. I see that it performs StateMap(fst, ArcSumMapper<Arc>(*fst)) ; to sum arc weights. If I'm not mistaken, Determinize() by itself does such summing. Does summing the arc weights before determinization somehow speed things up?

RichardSproat - 2017-11-14 - 09:12

Again, I don't know. But frankly I think we are down in the weeds here. First of all demonstrate that this makes a noticeable difference with a live example. Then we can discuss how to tweak it. We have had many ideas on this or that improvement that might help. Sometimes, as with the implicit grouping of cascaded rules within an Optimize[], it makes a huge difference: without that, for a long chain of compositions if one wrote

Optimize[rule1 @ rule2 @ .... @ rulen]

the result could be disastrously slow. So what the compiler does is group those in a binary right-branching tree. That made it massively more efficient at compile time. Could we do better? Probably, if we know something about the individual rule FSTs and then cleverly combine them in an order that optimizes the process: if for example I know that the intersection of range of rule_k and the domain of rule_k+1 filters things down to a much smaller set, then it would be good to combine those first. But in practice the binary branching tree seems to get you good enough results nearly all of the time. Most of the time when things break down it is because people are trying to do things that are inherently very bad anyway.

KennethRBeesley - 2017-11-16 - 10:39

Thanks for the response. At Xerox too we found that composing a cascade of rules could be not only inefficient but could also easily explode in size. We found that if the rules were to be composed with an FST encoding a lexicon, it often helped to group the compositions in a left-branching tree ( ( ( ( lexicon @ rule1 ) @ rule2 ) @ rule3 ) @ rule4 ) etc. The lexicon effectively acted as a filter that often avoided the explosion.

RichardSproat - 2017-11-17 - 09:11

Yes, of course, we know that too, and most of the time people developing grammars know enough to do that by hand.

KennethRBeesley - 2018-01-03 - 13:22

In src/include/thrax/algo/optimize.h, the Optimize function has two arguments: void Optimize(MutableFst<Arc> *fst, bool compute_props = false) In a number of places in the optimization code, properties of the fst are queried with the fst->Properties(mask, compute_props) syntax. I understand that if compute_props is false (the default value), then the Properties (or the specific property?) are not recalculated. The question is this: Under what conditions should compute_props be set to true so that the Properties are recalculated?

KyleGorman - 2018-03-13 - 11:38

Generally speaking, this will just allow you to avoid calls to Determinize and/or RmEpsilon when the FSTs are already deterministic and/or epsilon-free but they weren't made so by virtue of previous calls to Determinize and/or RmEpsilon (which set the properties bit). For instance: it doesn't necessarily know that string FSTs are deterministic and epsilon-free.

I think of it not as something to make it go faster so much as something to give it tighter optimization bounds. For instance if it doesn't know that the transducer is weighted-cycles free, but compute_props is true, it will test for that property before deciding how to encode the FST during determinization. So if you have to have the smallest possible FST it's a good choice.

KyleGorman - 2018-03-13 - 11:41

Regarding your question about arc-sum mapping, it's only true that determinize does that during Optimize if it does the determinization using an unencoded FST. If the FST labels and/or weights are encoded (as they often are---depends on properties bits) you don't necessarily get arc-sum mapping as the result of determinization.
Log In

CDRewrite with a unioned FST expression

KennethRBeesley - 2017-05-29 - 20:59

I've been trying to write CDRewite rules such as

CDRewrite[ "c":"d" | "a":"o" | "t":"g", "" , "" , sigma_star]

to map any and all 'c's to 'd's, 'a's to 'o's, and 't's to 'g's, including mapping "cat" to "dog", but it appears to be syntactically impossible in Thrax. Similarly,

CDRewrite[ "a":"b" | "b":"a", "", "", sigma_star]

should semantically, I think, map "abba" to "baab".

Is the restriction just syntactic? or also semantic? I can see parallel rules working in another system that allows alternation rules expressed with an FST.

RichardSproat - 2017-05-30 - 09:06

AFAICT It works fine, assuming you put the parens around the replace operations

CDRewrite[ ("c":"d") | ("a":"o") | ("t":"g"), "" , "" , sigma_star];

rws-macbookair3:tmp rws$ thraxrewrite-tester --far=foo.far --rules=RULE --noutput=10 Input string: cat Output string: dog Input string: tacocat Output string: gododog

Note --noutput=10, which would show any other output options, if there were any.

KennethRBeesley - 2017-05-30 - 13:00

Thanks. I'm still figuring out, and getting used to, the precedence of the operators.
Log In

Default direction of rewrite for CDRewrite?

KennethRBeesley - 2017-05-29 - 19:35

The fifth argument to CDRewrite can be 'ltr', 'rtl' or 'sim'. As far as I can tell, the default is 'ltr'.

Corrections would be welcome.

Log In

Default parsing of string literals as UTF=8?

KennethRBeesley - 2017-05-29 - 12:51

By default "abc" is interpreted as a byte string, which can be overridden by specifying "abc".utf8. Is it possible to specify somehow that strings are, by default, to be interpreted as utf8? E.g. some kind of declaration like

default_string_parse_mode utf8 ;

Log In

Precedence of Thrax operators

KennethRBeesley - 2017-05-29 - 12:45

Is there documentation somewhere that specifies the precedence of the Thrax operators? In question are

<verbatim> the unary * + ? and {n, m} postfixed operators - (for subtraction) | (denoting union) : (cross product) @ (composition)

concatenation (no operator, shown by simple juxtaposition) </verbatim>

A special case might be weights, e.g., <1> and <2>. Do they attach with the same precedence as normal concatenation?

KennethRBeesley - 2017-05-29 - 19:13

I've noodled away for a few hours, testing precedence, and here's the list as best I can judge right now (from High to Low precedence)

the unary postfix operators: * + ? {n,m}

concatenation (shown by juxtaposition)

- (minus)

@ (composition)

| (union)

: (cross-product)

The <...> weight syntax seems to have a special status. It can appear only at the "end" of a regular expression, i.e. at the very end, or at the end of a regular expression enclosed in parentheses.

Corrections would be welcome.

Log In

Log In

Using Thrax with Java

RubaJ - 2017-02-26 - 06:23

Is there a way to import OpenGrm thrax (call thraxrewrite-tester) within Java?

RichardSproat - 2017-02-26 - 09:07

You'd have to write something to import the C++ library into Java. That is certainly doable but I am not an expert on Java.
Log In

AssertNull

CarloDiFerrante - 2016-12-22 - 08:18

Hi, I am working on a set of grammars and when trying to add some consistency checks I get an error for "Undefined function identifier: AssertNull". The other assert in the grammar are working just fine, any hint on what could be the issue?

Thank you very much!

RichardSproat - 2016-12-22 - 09:05

I would need to see your grammar to know whether it's a bug in the grammar or a bug in Thrax itself. Can you send them to me? You can use my Google address, rws@google.com

CarloDiFerrante - 2016-12-22 - 09:58

Thank you very much for getting back to me. I sent the grammar to your Google address.

RichardSproat - 2016-12-22 - 10:57

Thanks for finding this, and mea maxima culpa for pushing this out with that bug. As a temporary fix please replace src/lib/walker/loader.cc with the attached loader.cc at the end of this page (i.e. http://openfst.cs.nyu.edu/twiki/pub/Forum/GrmThraxForum/loader.cc) and reinstall.

I will push out a fixed version of the distribution as soon as I can.

RichardSproat - 2017-01-10 - 08:38

Just an update on this: the new version (Thrax 1.2.3) fixes this bug.

Log In

How can I use uint64_t type sequence as input?

WuAraleii - 2016-11-04 - 02:44

I want generate an automata for recognize inputs which consists a sequence of uint64_t type integers. I know how to use Thrax recognize byte(0~255) string, but I do not know how can I deal with this problem.

Hope you can help me! Thanks very much!

WuAraleii - 2016-11-04 - 03:20

Oh, I think I can use symbol table to solve this problem....

RichardSproat - 2016-11-07 - 08:23

Ok good. The question was rather unclear so I am glad you solved the problem.

Log In

Flip a 2-Digit Number

RubaJ - 2016-10-24 - 02:47

I am trying to convert Arabic numbers from text to digits. The problem is with numbers which are combined of decades and units((21,..,29), (31, ..,39), ..., (91, .., 99)) as we pronounce them in reverse order to how we write them as digits. for example: Twenty one in Arabic is One twenty but still written 21. So, the output of the grammar would be 12 instead of 21. how can I make the 12 to become 21? help!

RichardSproat - 2016-11-07 - 09:19

You can't easily do that with FSTs in any general way unfortunately: unbounded string reversal is not a regular operation. The best you can do is handle cases up to a fixed length, which is equivalent to enumerating the cases you want to reverse.

However, the PDT extension would allow you to do this more generally. See

http://openfst.org/twiki/bin/view/GRM/ThraxQuickTour

under the Pushdown Transducers section. That gives an example of a^n b^n which is similar to your problem, which is also similar to w w-reverse. In your case you would need to define 10 bracket pairs (one for each digit) rather than just one, and rather than just accepting strings of the form w w-reverse, you need to make sure symbols in w are deleted and the appropriate comparable symbols in w-reverse are inserted.

RubaJ - 2016-11-09 - 10:19

Thanks a lot for your help and support.

I have solved it as you suggested without using PDT. wrote 9 rules for digits (1-9) as follows:

(one : "") decades ("" : "1") | ... | (nine : "") decades ("" : "9")

Regards.

RichardSproat - 2016-11-10 - 09:03

Sure, well for limited length a PDT is not necessary. Glad you solved it.
Log In

Issue with GRM file that only contains function definitions.

RichardSproat - 2016-02-05 - 13:04

I just discovered a bug in the underlying FAR reader code that causes a problem if one of your grammars only has functions and no exports.

It is perfectly legal in Thrax to have a grammar that only has functions, but if you try to import that grammar into another grammar the compiler will dump core due to an error apparently with the STTableFarReader.

This will get fixed hopefully soon, but in the meantime the workaround is to include a trivial export such as

export FOO = "a";

in your function file.

Log In

Simple tool for running FARs?

FilipG - 2015-09-12 - 02:06

Is there a simple tool available for transforming standard input into standard ouput with a FAR (within OpenFST or Thrax or somewhere else)? I mean something like

<verbatim> process-with-far transducer.far < in.txt > out.txt </verbatim>

Of course, you've got thraxrewrite-tester, but it pollutes the output with "Input/Output string:" and it was not written with efficiency in mind. It wouldn't be difficult to hack it, but I am wondering whether anything else is available.

RichardSproat - 2015-09-12 - 09:29

Not exactly, since you presumably want to select which FSTs to select from the far.

I will be releasing a new version of Thrax at some point (when I can get around to it) that will have various changes, and I could add that as a feature to the rewrite-tester. But that may not be soon enough for you.

Log In

Undefined error while compiling thrax

PrashantGupta - 2015-08-31 - 06:27

Hi, i am using the thrax(1.0.2) and fst(1.3.4) in gcc version of 4.4.7(I need to make it work on this version). I have built the fst with --enable-far=yes --enable-pdt=yes, but i still get the following errors

//usr/local/lib/libthrax.so: undefined reference to `fst::IsSTList(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)' //usr/local/lib/libthrax.so: undefined reference to `fst::IsSTTable(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'

I use " g++ -g -O2 -std=c++0x -o nersuite nersuite-main.o nersuite-nersuite.o nersuite-FExtor.o nersuite-crfsuite2.o ../nersuite_common/libnersuite_common.a -lcrfsuite -llbfgs -lm -ldl -lfst -lthrax -Wl -lboost_unit_test_framework" to compile my code. I also tried using the command in ldconfig.

Any help would be appreciated. Thank you

RichardSproat - 2015-08-31 - 09:05

The latest version of Thrax is 1.1.0. Have you tried that version? It requires OpenFst 1.4.0, but that should work with your compiler. I would strongly recommend using that route. You also get more features in Thrax that way.

PrateekBaranwal - 2015-09-01 - 16:33

Richard 1.4.0 does not build with gcc 4.4.7 [Default RedHat Servers running RHEL 6.x].

./../include/fst/union.h:140: instantiated from ‘fst::UnionFst<A>::UnionFst(const fst::Fst<A>&, const fst::Fst<A>&) [with A = fst::ArcTpl<fst::LogWeightTpl<float> >]’ stl_pair.h:90: error: invalid conversion from ‘int’ to ‘const fst::Fst<fst::ArcTpl<fst::LogWeightTpl<float> > >*’

RichardSproat - 2015-09-02 - 10:15

I see. I misread your version number.

Unfortunately in general it's a little hard to support older versions, with compilers changing and so forth. For me to reproduce your error would require me to replicate your set of conditions, which would include the out-of-date compiler you are using.

So I have two suggestions for you.

1) Upgrade your compiler to 4.7. Then you'll get the benefit of the latest version of OpenFst and the latest version of Thrax.

Or if you cannot do that, then:

2) Read further down on this page where you will find that someone reported what looks like the exact same error about a year and a half ago. See my reply dated 12 Jan 2014 - 13:40. See if my suggestion works.

In fact one of the reasons for forums like this is to archive these sorts of problems, so it's good to check if someone else has reported the same or a similar issue before posting.

KennethRBeesley - 2015-12-02 - 13:25

On the issue of configuration options, the example above shows --enable-far=yes --enable-pdt=yes Is that correct? An example on http://www.cslu.ogi.edu/~sproatr/Courses/textNorm/tutorial.html shows something different: --enable-far=true Should --enable-far all by itself work?

KennethRBeesley - 2015-12-02 - 15:45

For OpenFst, ./configure --help lists optional features --enable-far and --enable-pdf without any suggestion that it needs or takes =yes or =true or =anything.

RichardSproat - 2015-12-03 - 09:03

Thanks Ken for pointing these out. They were relics from an earlier version. I have corrected the error in the quick tour. I will correct errors in the config and other places when I do a release of a new version sometime soon.
Log In

Pluralization

AlexanderSolovets - 2015-08-15 - 18:13

Suppose I have a transducer that turns numbers into their spoken representation, e.g. 23 -> twenty-three. Now I want to handle US currency, so $23 becomes "twenty-three dollars". Obviously for "$1" it is "one dollar". To implement it in Thrax I might just add the whole string as the alternative path with the lower weight, but as I have many different units ("2m" -> "two meters", but "1m" meter) I wonder what would be the idiomatic way to implement pluralization? I feel like I should use Features and Paradigms, but I lack good examples of their application. Thank you.

RichardSproat - 2015-08-16 - 09:16

You could use the features functionality, though for English this might be a bit of overkill. For simple cases like English I would just have two StringFiles, one for the singulars and one for the plurals, then define singular_nouns to use the first and plural_nouns the second, then just do the obvious combination with "1" versus all the other numbers.

If you wanted to use the features/paradigms functionality, there's an example for a more complex case in the distribution. See: src/grammars/paradigms_and_features.grm

Log In

regex lookahead in Thrax

BernardR - 2015-06-29 - 10:39

Is it possible to do lookahead in the Thrax grm files? For example, require at least one digit, one lowercase, and one uppercase as in regex below:

( (?=.*\d) (?=.*[a-z]) (?=.*[A-Z]) .{6,20} )

Thanks

RichardSproat - 2015-07-06 - 18:01

I'm not sure what you are trying to do, but you may just want to use a CDRewrite rule, which allows you to change one regexp to another in the context of two other regexps that are not considered part of the first two regexps.

BernardR - 2015-07-08 - 15:26

So there is no simple way to use regex lookahead? So Thrax does not support this? Would like to create FSA to detect the pattern described. Thanks.

RichardSproat - 2015-07-09 - 09:04

Regex lookahead is not something that is implemented per se. But CDRewrite implements all of the functionality that one uses regexp lookahead in PCRE's for, as far as I can tell. If you want to detect a regular expression in the context of another regular expression and know that you have detected it, an easy way is to write a CDRewrite rule that inserts some marker after (or before) the first regular expression if it occurs in the context of the second regexp. This gives you all the functionality that the PCRE lookahead would give you.

Log In

User defined symbol tables on PDTs

SofiaK - 2014-12-18 - 09:41

Hi all,

I am new to Thrax and OpenFst and I would appreciate it a lot if you could help me with the following issue. I need to use my own symbol table with a PDT or to be able to extract the symbol table in a non-binary format. So far I was not able to do so as the fst extracted from my far has an empty symbol table.

Let me show you how I worked:

1. I created my grammar that will cover digits one to nine and I got the symbol table I use let's say with another fst.

numbers_en_US.grm

# Numbers simple grammar for en-US. # Covers numbers 0 to 9

my_symbol_table=SymbolTable['numbers.txt'];

export PARENS = ("[<s>]" : "[</s>]");

space = " " ;

units = Optimize [ ("zero".my_symbol_table) | ("one".my_symbol_table) | ("two".my_symbol_table) | ("three".my_symbol_table) | ("four".my_symbol_table) | ("five".my_symbol_table) | ("six".my_symbol_table) | ("seven".my_symbol_table) | ("eight".my_symbol_table) | ("nine".my_symbol_table) ];

export NUMBERS = ("[<s>]" (units space)* units "[</s>]")* ;

numbers.txt

eight 0

extra1 1

extra2 2

<eps> 3

five 4

four 5

nine 6

one 7

</s> 8

<s> 9

seven 10

six 11

three 12

two 13

zero 14

2. Then I compiled my grammar, extracted the fst from the far and checked the fst info:

$ fstinfo NUMBERS

fst type vector

arc type standard

input symbol table none

output symbol table none

# of states 12

# of arcs 32

initial state 11

...

3. So as the symbol table is empty, when I test, it is impossible to get rewrites:

$ thraxrewrite-tester --far=numbers_en_US.far --rules=NUMBERS\$PARENS --output_mode=numbers.txt

Input string: one

Rewrite failed.

$ thraxrewrite-tester --far=numbers_en_US.far --rules=NUMBERS\$PARENS

Input string: one

Rewrite failed.

So, any ideas on how to use my symbol table? Or even how to get the internal symbol table in a non-binary format?

Thanks, Sofia

RichardSproat - 2014-12-19 - 10:39

RichardSproat - 2014-12-19 - 10:45

The symbols generated for the PARENS will be in the FST named *StringFstSymbolTable, which you will see if you do a farextract on the far.

But it looks as if you are assuming two symbol tables here, one being your own, the other being the one that will be generated for those extended labels. I think what you want to do is something like this:

export PARENS = ("<s>".my_symbol_table : "</s>".my_symbol_table);

Then you need to run the compiler with the --save_symbols flag. Finally you will need to use the --input_mode and probably the --output_mode flags to thraxrewrite-tester with the argument being your symbol table.

If that still doesn't work, can you send me (rws@google.com) the complete set of files needed to build your target, and I will have a look.

--R

SofiaK - 2014-12-24 - 05:12

Hi Richard, I followed your advice but the .far I get with my symbol table is completely different from the one without it. Which is expected but "initial state 0" worries me for example. I will send you my set of files to get an idea.
Log In

compile error on openSuse 13.1

RogerB - 2014-11-19 - 14:38

Hi, I downloaded openfst 1.4.1 and opengrm-ngram 1.2.1 but the latter won't compile on openSuse 13.1.

./configure says "configure: error: fst/extensions/far/far.h header not found"

however i find this file at /home/roger/sphinx/openfst-1.4.1/src/include/fst/extensions/far/far.h

compile&installation of openfst was successfull (as far i can tell yet)

do I need to add this path/header file somewhere?

Thanks Roger

RogerB - 2014-11-19 - 16:19

oh, i found out openfst must be 'built' with ./configure --enable-far=true

RichardSproat - 2014-11-20 - 09:01

Right, glad you found it.

Log In

Russian phonetic transcription rules

AlexisWilpert - 2014-08-22 - 09:40

Hi all, nice to meet you!

Let me introduce myself, as I am new here. My name is Alexis and I am a computational linguist and software developer. I was very excited with the discovery of the Thrax framework and after a short investigation I decided this was my thing smile I immediately started digging into it, but unfortunately I was not able to find "real-world" examples of usage, which would have simplified my task.

However, I just kept going on. I have been working for Yandex and developing a rule-based system for generating Russian phonetic transcriptions (in the context of speech synthesis). My company has been very generous and allowed me to open source the rules I wrote.

Probably I do not even use half of the power of Thrax, but I managed to write a working rule-based system just sticking to the basics smile I thought this could be useful for someone else (as it would have been for myself at the beginning). That is why I thought I should post here about them. Please, take in account that this was my first try with Thrax and that I probably could have written the rules in a much better way, if I had more knowledge.

In case someone is interested, you will find them here: https://github.com/wilpert/RusPhonetizer/tree/master/grammars

Thrax was a wonderfully powerful and easy to use framework for my work, something I did not experience before. I am utterly thankful to the authors for their amazing achievement. And to Yandex for allowing me to share my work.

Thanks to you all and be happy smile

Alexis

RichardSproat - 2014-11-17 - 09:11

RichardSproat - 2014-11-17 - 09:15

Hi Alexis:

Glad it has proved useful to you. Yeah there are various toy examples around, but not much "real world" examples that I know of that are public, at least not yet.

I'll be happy to take a look sometime at your grammars and send along suggestions if I have any.

Richard Sproat

AlexisWilpert - 2014-11-29 - 12:57

Hi Richard,

yes, it would be great if you would find any time to have a look at my grammars, any feedback would be terribly appreciated!

Thanks again for the software,

Alexis

Log In

Error compiling on Ubuntu VM

EstherJudd - 2014-06-19 - 13:02

I am trying to compile Thrax in a Ubuntu VM using VirtualBox. I have gcc 4.8.2 installed and compiled openfst with far and pet enabled and in shared mode. I have 1Gb of RAM dedicated to the VM. If I try ./configure --enable-shared, it fails because I run out of memory. If I try just ./configure and then make, everything seems to compile ok until I get an internal compilation error:

/bin/bash ../../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I./../include -std=c++0x -MT loader.lo -MD -MP -MF .deps/loader.Tpo -c -o loader.lo `test -f 'walker/loader.cc' || echo './'`walker/loader.cc libtool: compile: g++ -DHAVE_CONFIG_H -I./../include -std=c++0x -MT loader.lo -MD -MP -MF .deps/loader.Tpo -c walker/loader.cc -fPIC -DPIC -o .libs/loader.o g++: internal compiler error: Killed (program cc1plus)

RichardSproat - 2014-06-20 - 09:12

Try commenting out the lines that refer to Log64Arc in src/include/thrax/function.h, viz

function.h:70:extern Registry<Function<fst::Log64Arc>* > kLog64ArcRegistry; function.h:87: typedef name<fst::LogArc> Log64Arc ## name; function.h:88: REGISTER_LOGARC_FUNCTION(Log64Arc ## name)

(Obviously be careful in that #define REGISTER_GRM_FUNCTION to leave the continuation "\"s all happy.

The downside is you won't get log64 arcs. The upside is it should be smaller. The fact that it's running out of memory in compiling the loader makes me suspect that may be the problem because for each of the different arc types, all of the templated classes have to be expanded. This should reduce the size, therefore. If that still doesn't work, remove log arcs too. You won't likely be using them. Indeed, for precisely these sorts of issues I have been thinking of disabling those in future versions.

EstherJudd - 2014-06-20 - 12:22

I did that and also had to comment out similar lines in src/lib/walker/evaluator-specialization.cc (lines 35 and 49-53).

I also tried taking out LogArc and all it's mentions in function.h and evaluator-specialization.cc. But I still get an internal compilation error.

LemOmogbai - 2014-11-16 - 11:42

Did you ever get this to work? I have the same problem compiling Thrax.
Log In

utils/utils.cc 'close' not declared?

StevenBedrick - 2014-02-17 - 18:01

Hello, Richard et al.-

While compiling Thrax 1.1 (against OpenFST 1.3.4 on an Ubuntu 13.10 system), I'm getting the following compilation error:

<pre> ... /bin/bash ../../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I./../include -g -O2 -MT utils.lo -MD -MP -MF .deps/utils.Tpo -c -o utils.lo `test -f 'util/utils.cc' || echo './'`util/utils.cc libtool: compile: g++ -DHAVE_CONFIG_H -I./../include -g -O2 -MT utils.lo -MD -MP -MF .deps/utils.Tpo -c util/utils.cc -fPIC -DPIC -o .libs/utils.o util/utils.cc: In function 'bool thrax::Readable(const string&)': util/utils.cc:139:13: error: 'close' was not declared in this scope close(fdes); ^ make[3]: * [utils.lo] Error 1 make[3]: Leaving directory `/home/steven/thrax-1.1.0/src/lib' make[2]: * [all-recursive] Error 1 make[2]: Leaving directory `/home/steven/thrax-1.1.0/src' make[1]: * [all-recursive] Error 1 make[1]: Leaving directory `/home/steven/thrax-1.1.0' make: * [all] Error 2

</pre>

Any ideas what might be going on here?

StevenBedrick - 2014-02-17 - 18:02

OK, having wiki formatting trouble. Trying the code snippet again:

<verbatim> /bin/bash ../../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I./../include -g -O2 -MT utils.lo -MD -MP -MF .deps/utils.Tpo -c -o utils.lo `test -f 'util/utils.cc' || echo './'`util/utils.cc libtool: compile: g++ -DHAVE_CONFIG_H -I./../include -g -O2 -MT utils.lo -MD -MP -MF .deps/utils.Tpo -c util/utils.cc -fPIC -DPIC -o .libs/utils.o util/utils.cc: In function 'bool thrax::Readable(const string&)': util/utils.cc:139:13: error: 'close' was not declared in this scope close(fdes); ^ make[3]: * [utils.lo] Error 1 make[3]: Leaving directory `/home/steven/thrax-1.1.0/src/lib' make[2]: * [all-recursive] Error 1 make[2]: Leaving directory `/home/steven/thrax-1.1.0/src' make[1]: * [all-recursive] Error 1 make[1]: Leaving directory `/home/steven/thrax-1.1.0' make: * [all] Error 2 </verbatim>

StevenBedrick - 2014-02-17 - 18:03

Third time's the charm? <!-- <pre> --> /bin/bash ../../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I./../include -g -O2 -MT utils.lo -MD -MP -MF .deps/utils.Tpo -c -o utils.lo `test -f 'util/utils.cc' || echo './'`util/utils.cc libtool: compile: g++ -DHAVE_CONFIG_H -I./../include -g -O2 -MT utils.lo -MD -MP -MF .deps/utils.Tpo -c util/utils.cc -fPIC -DPIC -o .libs/utils.o util/utils.cc: In function 'bool thrax::Readable(const string&)': util/utils.cc:139:13: error: 'close' was not declared in this scope close(fdes); ^ make[3]: * [utils.lo] Error 1 make[3]: Leaving directory `/home/steven/thrax-1.1.0/src/lib' make[2]: * [all-recursive] Error 1 make[2]: Leaving directory `/home/steven/thrax-1.1.0/src' make[1]: * [all-recursive] Error 1 make[1]: Leaving directory `/home/steven/thrax-1.1.0' make: * [all] Error 2 <!-- </pre> -->

StevenBedrick - 2014-02-17 - 18:04

OK, this is ridiculous. Click here to see a Gist:

https://gist.github.com/stevenbedrick/809dbe2c921d745fbcc6

RichardSproat - 2014-02-18 - 09:07

I don't know. I will have to investigate.

RichardSproat - 2014-02-18 - 09:31

Does explicitly including unistd.h help?

StevenBedrick - 2014-02-23 - 23:01

Yup, adding that #include to util/utils.cc does the trick.

RichardSproat - 2014-02-24 - 09:01

RichardSproat - 2014-02-24 - 09:09

Ok thanks.

So the question is why you aren't getting that by inheritance. This is the first time I've seen this problem and I have no idea where it has suddenly broken.

Log In

compilation fails

KyleGorman - 05 Nov 2013 - 14:53

Hi Richard (etc.), using Thrax 1.1.0 (and with OpenFst 1.3.4 already installed), compilation fails while making the file `ast/identifier-node.cc` due to an issue in the `include/thrax/compat/utils.h` header. Here's the error:

/bin/sh ../../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H -I./../include -g -O2 -MT identifier-node.lo -MD -MP -MF .deps/identifier-node.Tpo -c -o identifier-node.lo `test -f 'ast/identifier-node.cc' || echo './'`ast/identifier-node.cc libtool: compile: g++ -DHAVE_CONFIG_H -I./../include -g -O2 -MT identifier-node.lo -MD -MP -MF .deps/identifier-node.Tpo -c ast/identifier-node.cc -fno-common -DPIC -o .libs/identifier-node.o In file included from ast/identifier-node.cc:22: ./../include/thrax/compat/utils.h:119:8: error: field has incomplete type 'char []' char buf[]; ^

I presume this is because buf[] doesn't have a length defined (nor is it initialized with a string), and when I change the line to

char buf[1024];

compilation goes through. (I'm not sure this is a sensible default; I spent no time trying to understand what this code is doing.)

I'd include a patch but it's one line.

Kyle

RichardSproat - 05 Nov 2013 - 16:38

Just remove that line: that variable is not used. Apparently it's a holdover from some earlier implementation, and I just forgot to update it. I'll fix this in the next release.
Log In

TEST

RichardSproat - 13 Sep 2013 - 12:16

This is a test. Please ignore.

Log In

Recommended way to obtain FST+symbols for use

JosefNovak - 10 Jun 2013 - 09:46

Hi,

I am currently using thrax to extend my some features of an alignment tool I wrote for my g2p system.

The basic idea is that the user can specify some alignment correspondence rules and optional default penalties, and then these can be incorporated into the EM training process.

At present I have kind of hacked the functionality of the thraxcompiler command tool to read in the grammar, and then return the desired FST+symbol table to the alignment program.

EDIT: Maybe it makes more sense to just provide a couple of snippets:

GetFstFromGrammar

template <typename Arc>
VectorFst<Arc> GetFstFromGrammar(const string& input_grammar, const string& rules_name) {
  GrmCompilerSpec<Arc> grammar;
  VectorFst<StdArc> rules;
  if ( grammar.ParseFile(input_grammar) && grammar.EvaluateAst() ) {
    const GrmManagerSpec<Arc>* manager = grammar.GetGrmManager();
    FstMap fsts = manager->GetFstMap();
    for( typename FstMap::const_iterator it=fsts.begin();
         it != fsts.end(); ++it ){
      cout << "Echo: " << it->first << endl;
    }
    rules = *fsts[rules_name];
    return rules;
  }

  return rules;
}

toy.grm

sy = SymbolTable['simple.syms'];

zero  = "0".sy : "zero".sy;
units = ( "these're".sy : ( "these're".sy | "[these]" | "[these]" "are".sy ) );
split = ( "[these]" "are".sy : "these're".sy );
sigma = "<sigma>".sy : "<sigma>".sy;
abc   = ( "a".sy "b c".sy : "a b b".sy );
export RULES = Optimize[ sigma* ( units | zero | abc ) sigma* ];

Here the 'sigma' is used in combination with a specialized 1-state alignment transducer that relies on RHO and SIGMA matchers.

Is there an alternative or recommended way to do this? It would be great if I could either specify the symbol table just once at the beginning, or automatically infer/generate the whole symbol table and return it - or even better modify the grammar from my C++ application to simply what the user is responsible for doing.

I went through the FAQ but did not notice any answers to these questions.

Thanks for your time.

UPDATE: I solved this by creating some bindings with pybindgen and then writing a generator that interprets a simplified version of the Thrax grammar, then expands it to the versbose version with the extra quotes and symfile suffixes, etc.

JackRoh - 2016-06-22 - 02:18

Hi, great help!, you can share pybindgen side code as well if you wish smile

JackRoh - 2016-06-22 - 02:39

I'd like to run this fst model for Inverse Text Normalization task. it is running on shell with

$ thraxrewrite-tester --far=main.far --rules=ITN < text.txt

and I need to use this in c++. so I did convert grm file to fst file with below

fstcompile --isymbols=$byte_sym --osymbols=$byte_sym ${fst}.fst.txt | fstarcsort --sort_type=olabe l - > ./${ODir}/${fst}.fst

so I have fst file to load.. but how could I call this fst model in C++ so that I could feed sequence of string as ITN input, and get ITN output?

and please share for the symboltable as well. Just for refering. Thanks!

RichardSproat - 2016-06-22 - 09:33

RichardSproat - 2016-06-22 - 09:34

The best way to do that would be to link with the library and use GrmManager to load the far, and then you can specify whatever rules you want to apply. If you follow the example in the rewrite-tester that should give you an idea of how to do it.

JackRoh - 2016-06-23 - 21:10

Thanks Richard for the reply!

rewrite-tester example means thrax-1.2.2/src/grammars files.. right? I did go through all and I built rewriter far and fst files

what I want is to use these files to load my other c/c++ program.

Thanks in advance!

RichardSproat - 2016-06-24 - 09:06

No, that is not what I meant.

Look in src/bin at the code for rewrite tester. Then look and see what it does. Then figure out how to write similar code that uses the GrmManager in the same way to do what you want.

Hopefully that is clearer.

Log In

Need some help, New to "Thrax"

GoudjilKamel - 03 Jan 2013 - 17:29

compiling under unbuntu LTS 12.04 : got the msg below at linking libtool: link: g++ -g -O2 -o .libs/thraxcompiler compiler.o -L/usr/local/lib/fst -lm -ldl -lfst /usr/local/lib/fst/libfstfar.so ../lib/.libs/libthrax.so -Wl,-rpath -Wl,/usr/local/lib/fst -Wl,-rpath -Wl,/usr/local/lib ../lib/.libs/libthrax.so: undefined reference to `fst::IsSTList(std::basic_string<char, std::char_traits, std::allocator > const&)' ../lib/.libs/libthrax.so: undefined reference to `fst::IsSTTable(std::basic_string<char, std::char_traits, std::allocator > const&)' collect2: ld returned 1 exit status

RichardSproat - 29 Aug 2013 - 11:47

Did you compile the fst library with the far extension?

DanXu - 08 Jan 2014 - 02:55

I also have encountered the same problem with v1.1.0(compile export/batch_test), and compiled thrax with far enable.

RichardSproat - 08 Jan 2014 - 09:06

Yes, but did you also compile the fst library with far enabled?

DanXu - 09 Jan 2014 - 09:53

yes (openfst 1.3.4 compiled with --enable-far and some other enable options ), thrax compiled successfully,but compilation fails while making the file `batch_test.c` (extracted form export.tgz), can you me some advice

RichardSproat - 10 Jan 2014 - 09:11

I'd like to but first I need to understand what is going on. I can't reproduce your error (apparently) and I don't know what batch_test.c is since it's not part of the Thrax distribution. Is this your own code? If so then I need to see EXACTLY what you are doing, including probably your sending me a directory with all of the additional code.

If this is part of the Thrax distribution then please tell me where it is because I can't find it (nor do I remember such a file).

DanXu - 11 Jan 2014 - 09:18

thank you for your reply.

in this page:

http://openfst.cs.nyu.edu/twiki/bin/view/Contrib/ThraxContrib,

you can see

Projects using the OpenGrm Thrax tools: export.tgz: Grammars and software developed as part of a text normalization class taught at the Center for Spoken Language Understanding, Fall 2011. URL for the course: http://www.cslu.ogi.edu/~sproatr/Courses/TextNorm/

i download "export.tgz" . there is a file called batch_tester.cc in batch_tester directory(extract from export.tgz)。

RichardSproat - 12 Jan 2014 - 09:08

Ok that helps. Yes, I did write that, but it wasn't obvious from your query that this is what you were referring to. Please in future give all necessary information when reporting a bug.

In the meantime I will have a look. I do not know off the top of my head what the problem is.

RichardSproat - 12 Jan 2014 - 13:40

Ok it's the usual nonsense about ordering of shared object libraries. If you do things in this order it should work:

g++ -g -O2 -o batch_tester batch_tester.o -L/usr/local/lib/fst -lm -ldl -lfst -lthrax -Wl,--rpath -Wl,/usr/local/lib/fst -Wl,--rpath -Wl,/usr/local/lib/fst -Wl,--rpath -Wl,/usr/local/lib /usr/local/lib/fst/libfstfar.so

Evidently there is a bug in the configuration of the distribution that was not causing problems before, but is now. I will look into that, but in the meantime, please try linking manually as above.

DanXu - 14 Jan 2014 - 03:47

it's ok using above command you wrote,thanks!

RichardSproat - 14 Jan 2014 - 09:01

Ok good, I'll update the tar file. Not sure why it worked before and not now, but I won't think about that.
Log In

Weight semiring

LauriLyly - 21 Nov 2012 - 00:34

So far I find thrax a very neat piece of software but I have two questions...

Can I somehow use probability semiring as weights, because it seems Thrax only allows specifying log and tropical semirings? How about the other ones... Or should I somehow postprocess the generated far file?

Another question: I tried to use "fstdraw" on a far file, but got: ERROR: FstHeader::Read: Bad FST header: example.far

Is this a version mismatch?

LauriLyly - 29 Nov 2012 - 07:34

Sorry, obviously my bad as it's a far and not an fst file stick out tongue Still not too familiar. But the weight question still applies wink

RichardSproat - 29 Nov 2012 - 10:07

Sorry, I missed the earlier comment -- for some reason I didn't get email about it.

Unfortunately the restriction to Log and Tropical is due to a similar restriction in the fst library: the real semiring does not come predefined. The best suggestion would be to use Tropical and then just do the obvious e^-cost conversion.

Log In

Access control:

-- CyrilAllauzen - 13 Aug 2012

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatcc loader.cc r1 manage 2.9 K 2016-12-22 - 15:55 RichardSproat Fixed version of loader.cc to address issue found by Carlo DiFerrante.
Edit | Attach | Watch | Print version | History: r169 < r168 < r167 < r166 < r165 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r169 - 2024-11-01 - RichardSproat
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback