|
1 | | -jopenfst |
| 1 | + |
| 2 | +JOpenFST |
2 | 3 | ======== |
3 | 4 |
|
4 | | -A partial Java port of the C++ [OpenFST library](http://www.openfst.org/twiki/bin/view/FST/WebHome) which provides a library to |
5 | | - build weighted finite state transducers and perform various common FST tasks such as: |
| 5 | +A _partial_ Java port of the C++ [OpenFST library](http://www.openfst.org/twiki/bin/view/FST/WebHome), which provides |
| 6 | +a library to build weighted finite state transducers (WFSTs) and perform various common FST/FSA tasks such as: |
6 | 7 |
|
7 | | -* Determinization |
| 8 | +* Determinization (for both acceptors and transducers) |
8 | 9 | * Union / Intersection |
9 | 10 | * Composition |
10 | 11 | * Shortest Path computation |
11 | | - |
12 | | -This project was originally forked from the CMU Sphinx project. This was originally work by John |
13 | | -Salatas as part of his GSOC 2012 project to port phonetisaurus over to java. Since then the code appears to be |
14 | | -abandoned and doesn't appear to have been integrated in to the final CMU Sphinx project trunk. I needed a decent |
15 | | -WFST library for some of my stuff, so I used his code as a starting point. I have cleaned up quite a bit |
16 | | - of the code, really changed the APIs, and updated unit tests. My JG2P project uses this, and thus I have |
17 | | - some confidence that the code is working accurately. However, I would still consider this *beta quality*. When I am comfortable with the stability, I will push a v1.0 to Maven Central Repo. |
| 12 | + |
| 13 | +OpenFST is a mature, elegant, and feature rich library in C++. JOpenFST aims to implement many of the features of |
| 14 | +OpenFST in a pure Java implementation, which can be useful if you are trying to use WFST operations within an existing |
| 15 | +pure Java architecture or service. Some environments are not easily suited to using JNI or creating a separate C++ |
| 16 | +service endpoint to do all of the WFST operations. In these circumstances, JOpenFST provides an alternative. |
| 17 | + |
| 18 | +OpenFST is designed quite elegantly and relies on sophisticated C++ template metaprogramming features to achieve |
| 19 | +top speed (and nice generality). Since Java offers no rich metaprogramming facility, JOpenFST differs significantly |
| 20 | +in its API and implementation. JOpenFST is probably more fairly described as a re-imagining of OpenFST within the |
| 21 | +constraints and idioms of Java/JVM. Because of that, JOpenFST lacks some features that are present in OpenFST, but |
| 22 | +hopefully that gap will close over time (PRs welcome!). |
| 23 | + |
| 24 | +Here are some of the most notable differences from OpenFST: |
| 25 | +* All WFST operations are eagerly executed, there is no deferred/lazy evaluation. There are however some optimizations |
| 26 | + when doing operations on Immutable instances (see Compose) to avoid unnecessary copying. |
| 27 | +* JOpenFST can only import/export using the OpenFST/AT&T text format (as produced by `fstprint` and consumed by |
| 28 | + `fstcompile`); JOpenFST cannot currently import OpenFST binary models (as produced by `fstcompile`). |
| 29 | +* There are mutable and immutable types that mirror each other (MutableFst, ImmutableFst, MutableState, ImmutableState, etc.) |
| 30 | +* There is a Gallic weight and semiring but not a separate String semiring. |
| 31 | +* The Gallic Weights are either Gallic Restricted or Gallic Min; if you want General Gallic weights, you have to use the |
| 32 | + Union Semiring directly. |
| 33 | +* The following operations are implemented: |
| 34 | + * ArcSort |
| 35 | + * Compose |
| 36 | + * Connect |
| 37 | + * Determinize (for both acceptors and transducers; all modes: functional, non-functional, and disambiguate) |
| 38 | + * Shortest Paths |
| 39 | + * Project |
| 40 | + * Remove Epsilon |
| 41 | + * Reverse |
| 42 | +* The following operations are currently NOT implemented (PRs welcome): |
| 43 | + * Minimization (coming soon) |
| 44 | + * TopSort (coming soon) |
| 45 | + * Closure |
| 46 | + * Concat/Union |
| 47 | + * Encode/Decode |
| 48 | + * Difference/Intersect |
| 49 | + * Invert |
| 50 | + * Prune (as a separate operation) |
| 51 | + * Push |
| 52 | + * Synchronize |
| 53 | + |
| 54 | +This project was originally work in the CMU Sphinx project by John Salatas as part of his GSOC 2012 project, but since |
| 55 | +then it has been mostly rewritten to bring in new enhancements, improve the API, and improve performance. |
18 | 56 |
|
19 | 57 | Current version: |
20 | 58 | ```xml |
21 | 59 | <dependency> |
22 | 60 | <groupId>com.github.steveash.jopenfst</groupId> |
23 | 61 | <artifactId>jopenfst</artifactId> |
24 | | - <version>0.1.1.ALPHA</version> |
| 62 | + <version>0.3.0</version> |
25 | 63 | </dependency> |
26 | 64 | ``` |
27 | 65 |
|
28 | 66 | Quick Start |
29 | 67 | ----------- |
30 | | -The API started out pretty close to OpenFST but is diverging over time. The basic abstractions of `Fst`, `State`, `Arc`, |
31 | | -and `SymbolTable` have conceptual analogs in OpenFST. In jopenfst there are *Mutable* and *Immutable* implementations |
| 68 | +The API started out pretty close to OpenFST but has diverged over time. The basic abstractions of `Fst`, `State`, `Arc`, |
| 69 | +and `SymbolTable` have conceptual analogs in OpenFST. In JOpenFST there are *Mutable* and *Immutable* implementations |
32 | 70 | of each. As you programmatically build up your WFSTs, you will use the Mutable API. If you want to de/serialize larger |
33 | | -models (large WFSTs built from training data that are used to construct lattices) and these models don't need to change, then you can convert the mutable instance into an immutable instance after you are done building it (`new ImmutableFst(myMutableFst)`. |
34 | | -ImmutableFsts are likely faster at some operations and also are smarter about reducing unnecessary copying of state. |
| 71 | +models (large WFSTs built from training data that are used to construct lattices) and these models don't need to |
| 72 | +change, then you can convert the mutable instance into an immutable instance after you are done building it |
| 73 | +(`new ImmutableFst(myMutableFst)`. ImmutableFsts are likely faster at some operations and also are smarter about |
| 74 | +reducing unnecessary copying of state. |
35 | 75 |
|
36 | 76 | The MutableFst API is probably the bast place to start. Here is a sample showing how to |
37 | 77 | construct a WFST which shows the basic operations of fsts, states, arcs, and symbols. |
@@ -61,33 +101,34 @@ fst.addArc(startState, "inC", "outD", fst.getOrNewState("state3"), 123.0); |
61 | 101 | ``` |
62 | 102 | Input and Output |
63 | 103 | ---------------- |
64 | | -jOpenFst supports reading/writing the OpenFst text format and our own jopenfst binary serialization format (more compact than text). We cannot currently read/write openfsts binary serialization format, though pull requests with that functionality are very welcome. |
| 104 | +jOpenFst supports reading/writing the OpenFst text format and our own JOpenFST binary serialization format |
| 105 | +(more compact than text). We cannot currently read/write OpenFSTs binary serialization format. |
65 | 106 |
|
66 | | -To read OpenFst text format, you need a `mymodel.fst.txt` file that describes all of the arcs and weights. If you are using labeled states, inputs, or outputs (e.g. for a transducer) then you also need files for those named `mymodel.input.syms`, `mymodel.output.syms`, and `mymodel.states.syms` respectively. An exmaple of these files is in the `src/test/resources/data/openfst` folder in the source. |
| 107 | +To read OpenFst text format, you need a `mymodel.fst.txt` file that describes all of the arcs and weights. If you |
| 108 | +are using labeled states, inputs, or outputs (e.g. for a transducer) then you also need files for those |
| 109 | +named `mymodel.input.syms`, `mymodel.output.syms`, and `mymodel.states.syms` respectively. An exmaple of these files |
| 110 | +is in the `src/test/resources/data/openfst` folder in the source. |
67 | 111 |
|
68 | | -To read/write the text format call methods `Convert.importFst(..)` and `Convert.export(..)`. Both of these return instances of `MutableFst` which can be converted into `ImmutableFst` via `new ImmutableFst(myMutableFst)`. There are importFst overloads for dealing with either Files or resources from the classpath. |
| 112 | +To read/write the text format call methods `Convert.importFst(..)` and `Convert.export(..)`. Both of these return |
| 113 | +instances of `MutableFst` which can be converted into `ImmutableFst` via `new ImmutableFst(myMutableFst)`. |
| 114 | +There are importFst overloads for dealing with either Files or resources from the classpath. |
69 | 115 |
|
70 | | -To read/write the binary format call methods `FstInputOutput.readFstFromBinaryFile` and `FstInputOutput.writeFstToBinaryFile` (there are overloads for dealing with streams/resources. Resources are useful if you want to package your serialized model in your jar and just read it from the classpath. |
| 116 | +To read/write the binary format call methods `FstInputOutput.readFstFromBinaryFile` and |
| 117 | +`FstInputOutput.writeFstToBinaryFile` (there are overloads for dealing with streams/resources. |
| 118 | +Resources are useful if you want to package your serialized model in your jar and just read it from the classpath. |
71 | 119 |
|
72 | 120 | Resources |
73 | 121 | --------- |
74 | 122 |
|
75 | 123 | * [John Salatas' blog](http://jsalatas.ictpro.gr/tag/java-fst/) has some posts that describe some of his initial design |
76 | | -decisions. I imagine that as I work on this these blog posts will become less representative of jopenFST but for the |
77 | | -moment its pretty close |
| 124 | +decisions. The library has diverged pretty significantly from this original version, but this is still a reference. |
78 | 125 | * [C++ OpenFST library](http://www.openfst.org/twiki/bin/view/FST/WebHome) describes some of the FST algorithms implemented. |
79 | 126 |
|
80 | | -Changes: |
| 127 | +Release History |
81 | 128 | ------------ |
82 | | - |
83 | | -* Adding back edges (kind of) to dramatically optimize a number of the original implementations that had poor algorithmic complexity |
84 | | -** In my jg2p project this reduced runtime for my datasets by a factor of 30x |
85 | | -* The original Connect/Trim implementation was wrong; fixed now. |
86 | | -* Separated out interfaces for read-only/writeable elements (Arcs, States, Fsts) which allows |
87 | | -convenient things like "union" symbol tables (to do mutating things without copying the entire source symbl table) |
88 | | -* Clearly separated out algorithms by ones that mutate input args vs ones that produce new fsts |
89 | | -* Updated some IO routines, used exceptions instead of System.err logging, some cleanup, fixed unit tests |
90 | | -* Changed packages (although it can still deserialize FST models from the original repo) |
| 129 | +* 0.3.0 - Adding union semiring, gallic semiring, and implemented Determinization for transducers with multiple modes |
| 130 | +(functional, nonfunctional, and disambiguate) to match the behavior and options available in OpenFST. |
| 131 | +* 0.2.0 - code covering unit tests, improvements to interoperability of the text format between JOpenFST and OpenFST. |
91 | 132 |
|
92 | 133 |
|
93 | 134 |
|
0 commit comments