TWiki
>
GRM Web
>
NGramLibrary
>
NGramAdvancedUsage
(2019-06-14,
KyleGorman
)
(raw view)
E
dit
A
ttach
%TOC% ---+ <nop>OpenGrm Advanced Usage Below are a variety of topics covered in greater depth or of more specialized interest than found in the Quick Tour. Reading the [[NGramQuickTour][Quick Tour]] first is recommended. #DistributedComputation ---++ Distributed Computation %ICON{"wip"}% The C++ operations in !OpenGrm offer extensive distributed computation support. N-gram counting can readily be parallelized by _sharding_ the data and producing a [[NGramCount][count FST]] <i>M<sub>d</sub></i> for each data shard _d_. These can then be [[NGramMerge][count-merged]] to produce a single count FST. Alternatively and with more parallelism, each <i>M<sub>d</sub></i> can be further split by _context_ _c_, which restricts each sub-model <i>M<sub>d,c</sub></i> to a specific range of n-grams. The <i>M<sub>d,c</sub></i> in the same context _c_ can then be count-merged to produce one model <i>M<sub>c</sub></i> for each context. The <i>M<sub>c</sub></i> are constructed to be in proper [[NGramModelFormat][n-gram model format]] and so can be processed in parallel by the estimation and pruning operations and then the pruned model components can be [[NGramMerge][model-merged]] into a single model at the end of this pipeline. We have implemented a complete distributed version of !OpenGrm NGram in [[http://dl.acm.org/citation.cfm?id=1806638][C++ Flume]], however this system is currently not released. Instead, we provide here some added functionality to our convenience script, [[NGramQuickTour#ConvenienceScript][ngramdisttrain.sh]]. While this does not perform parallel computation, it can construct a pruned model as described above using data and context sharding. This allows processing corpora that would otherwise exceed available memory provided adequate disk space (under =$TMPDIR=) and computation time are provided. The script also could serve as a starting point for a fully distributed implementation by parallelizing the calls internal to the script, which should linearly speed up the pipeline with the degree of parallelism. An implementation that didn't use the file system for all data sharing/transfer like =ngram.sh= would also help greatly. Multiple data shards are supported by specifying multiple input files to =ngramdisttrain.sh= with =--ifile="in.txt[0-9]"=. Multiple contexts are supported by specifying a context file with =--contexts=context.txt=. The best way to create a context file with, say, ten contexts balanced in size is with: <pre> ngramcontext --context=10 lm.fst context.txt </pre> where =lm.fst= is a n-gram LM that was built on a sample of the corpus (e..g, small enough to build unshared). Note you must provide a context file (even if it only has one context) if you want to use data sharding. If you wish the shared context models to be merged when the pipeline finishes, you should provide the =--merge_contexts= flag, otherwise the component models will be returned. -- Main.MichaelRiley - 07 Aug 2013
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r3 - 2019-06-14
-
KyleGorman
GRM
Log In
or
Register
GRM Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Webs
Contrib
FST
Forum
GRM
Kernel
Main
Sandbox
TWiki
Main
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback