Skip to content

Commit a9381d9

Browse files
committed
updating to 0.2.0, applying bugixes, doc updates, unit coverage
1 parent 850831f commit a9381d9

File tree

152 files changed

+2318
-1383
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

152 files changed

+2318
-1383
lines changed

.editorconfig

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
[*]
2+
charset=utf-8
3+
end_of_line=lf
4+
insert_final_newline=true
5+
indent_style=space
6+
indent_size=2
7+

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,3 +14,5 @@
1414
# virtual machine crash logs, see http://www.java.com/en/download/help/error_hotspot.xml
1515
hs_err_pid*
1616
target
17+
build
18+
classes

README.md

Lines changed: 73 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,37 +1,77 @@
1-
jopenfst
1+
2+
JOpenFST
23
========
34

4-
A partial Java port of the C++ [OpenFST library](http://www.openfst.org/twiki/bin/view/FST/WebHome) which provides a library to
5-
build weighted finite state transducers and perform various common FST tasks such as:
5+
A _partial_ Java port of the C++ [OpenFST library](http://www.openfst.org/twiki/bin/view/FST/WebHome), which provides
6+
a library to build weighted finite state transducers (WFSTs) and perform various common FST/FSA tasks such as:
67

7-
* Determinization
8+
* Determinization (for both acceptors and transducers)
89
* Union / Intersection
910
* Composition
1011
* Shortest Path computation
11-
12-
This project was originally forked from the CMU Sphinx project. This was originally work by John
13-
Salatas as part of his GSOC 2012 project to port phonetisaurus over to java. Since then the code appears to be
14-
abandoned and doesn't appear to have been integrated in to the final CMU Sphinx project trunk. I needed a decent
15-
WFST library for some of my stuff, so I used his code as a starting point. I have cleaned up quite a bit
16-
of the code, really changed the APIs, and updated unit tests. My JG2P project uses this, and thus I have
17-
some confidence that the code is working accurately. However, I would still consider this *beta quality*. When I am comfortable with the stability, I will push a v1.0 to Maven Central Repo.
12+
13+
OpenFST is a mature, elegant, and feature rich library in C++. JOpenFST aims to implement many of the features of
14+
OpenFST in a pure Java implementation, which can be useful if you are trying to use WFST operations within an existing
15+
pure Java architecture or service. Some environments are not easily suited to using JNI or creating a separate C++
16+
service endpoint to do all of the WFST operations. In these circumstances, JOpenFST provides an alternative.
17+
18+
OpenFST is designed quite elegantly and relies on sophisticated C++ template metaprogramming features to achieve
19+
top speed (and nice generality). Since Java offers no rich metaprogramming facility, JOpenFST differs significantly
20+
in its API and implementation. JOpenFST is probably more fairly described as a re-imagining of OpenFST within the
21+
constraints and idioms of Java/JVM. Because of that, JOpenFST lacks some features that are present in OpenFST, but
22+
hopefully that gap will close over time (PRs welcome!).
23+
24+
Here are some of the most notable differences from OpenFST:
25+
* All WFST operations are eagerly executed, there is no deferred/lazy evaluation. There are however some optimizations
26+
when doing operations on Immutable instances (see Compose) to avoid unnecessary copying.
27+
* JOpenFST can only import/export using the OpenFST/AT&T text format (as produced by `fstprint` and consumed by
28+
`fstcompile`); JOpenFST cannot currently import OpenFST binary models (as produced by `fstcompile`).
29+
* There are mutable and immutable types that mirror each other (MutableFst, ImmutableFst, MutableState, ImmutableState, etc.)
30+
* There is a Gallic weight and semiring but not a separate String semiring.
31+
* The Gallic Weights are either Gallic Restricted or Gallic Min; if you want General Gallic weights, you have to use the
32+
Union Semiring directly.
33+
* The following operations are implemented:
34+
* ArcSort
35+
* Compose
36+
* Connect
37+
* Determinize (for both acceptors and transducers; all modes: functional, non-functional, and disambiguate)
38+
* Shortest Paths
39+
* Project
40+
* Remove Epsilon
41+
* Reverse
42+
* The following operations are currently NOT implemented (PRs welcome):
43+
* Minimization (coming soon)
44+
* TopSort (coming soon)
45+
* Closure
46+
* Concat/Union
47+
* Encode/Decode
48+
* Difference/Intersect
49+
* Invert
50+
* Prune (as a separate operation)
51+
* Push
52+
* Synchronize
53+
54+
This project was originally work in the CMU Sphinx project by John Salatas as part of his GSOC 2012 project, but since
55+
then it has been mostly rewritten to bring in new enhancements, improve the API, and improve performance.
1856

1957
Current version:
2058
```xml
2159
<dependency>
2260
<groupId>com.github.steveash.jopenfst</groupId>
2361
<artifactId>jopenfst</artifactId>
24-
<version>0.1.1.ALPHA</version>
62+
<version>0.3.0</version>
2563
</dependency>
2664
```
2765

2866
Quick Start
2967
-----------
30-
The API started out pretty close to OpenFST but is diverging over time. The basic abstractions of `Fst`, `State`, `Arc`,
31-
and `SymbolTable` have conceptual analogs in OpenFST. In jopenfst there are *Mutable* and *Immutable* implementations
68+
The API started out pretty close to OpenFST but has diverged over time. The basic abstractions of `Fst`, `State`, `Arc`,
69+
and `SymbolTable` have conceptual analogs in OpenFST. In JOpenFST there are *Mutable* and *Immutable* implementations
3270
of each. As you programmatically build up your WFSTs, you will use the Mutable API. If you want to de/serialize larger
33-
models (large WFSTs built from training data that are used to construct lattices) and these models don't need to change, then you can convert the mutable instance into an immutable instance after you are done building it (`new ImmutableFst(myMutableFst)`.
34-
ImmutableFsts are likely faster at some operations and also are smarter about reducing unnecessary copying of state.
71+
models (large WFSTs built from training data that are used to construct lattices) and these models don't need to
72+
change, then you can convert the mutable instance into an immutable instance after you are done building it
73+
(`new ImmutableFst(myMutableFst)`. ImmutableFsts are likely faster at some operations and also are smarter about
74+
reducing unnecessary copying of state.
3575

3676
The MutableFst API is probably the bast place to start. Here is a sample showing how to
3777
construct a WFST which shows the basic operations of fsts, states, arcs, and symbols.
@@ -61,33 +101,34 @@ fst.addArc(startState, "inC", "outD", fst.getOrNewState("state3"), 123.0);
61101
```
62102
Input and Output
63103
----------------
64-
jOpenFst supports reading/writing the OpenFst text format and our own jopenfst binary serialization format (more compact than text). We cannot currently read/write openfsts binary serialization format, though pull requests with that functionality are very welcome.
104+
jOpenFst supports reading/writing the OpenFst text format and our own JOpenFST binary serialization format
105+
(more compact than text). We cannot currently read/write OpenFSTs binary serialization format.
65106

66-
To read OpenFst text format, you need a `mymodel.fst.txt` file that describes all of the arcs and weights. If you are using labeled states, inputs, or outputs (e.g. for a transducer) then you also need files for those named `mymodel.input.syms`, `mymodel.output.syms`, and `mymodel.states.syms` respectively. An exmaple of these files is in the `src/test/resources/data/openfst` folder in the source.
107+
To read OpenFst text format, you need a `mymodel.fst.txt` file that describes all of the arcs and weights. If you
108+
are using labeled states, inputs, or outputs (e.g. for a transducer) then you also need files for those
109+
named `mymodel.input.syms`, `mymodel.output.syms`, and `mymodel.states.syms` respectively. An exmaple of these files
110+
is in the `src/test/resources/data/openfst` folder in the source.
67111

68-
To read/write the text format call methods `Convert.importFst(..)` and `Convert.export(..)`. Both of these return instances of `MutableFst` which can be converted into `ImmutableFst` via `new ImmutableFst(myMutableFst)`. There are importFst overloads for dealing with either Files or resources from the classpath.
112+
To read/write the text format call methods `Convert.importFst(..)` and `Convert.export(..)`. Both of these return
113+
instances of `MutableFst` which can be converted into `ImmutableFst` via `new ImmutableFst(myMutableFst)`.
114+
There are importFst overloads for dealing with either Files or resources from the classpath.
69115

70-
To read/write the binary format call methods `FstInputOutput.readFstFromBinaryFile` and `FstInputOutput.writeFstToBinaryFile` (there are overloads for dealing with streams/resources. Resources are useful if you want to package your serialized model in your jar and just read it from the classpath.
116+
To read/write the binary format call methods `FstInputOutput.readFstFromBinaryFile` and
117+
`FstInputOutput.writeFstToBinaryFile` (there are overloads for dealing with streams/resources.
118+
Resources are useful if you want to package your serialized model in your jar and just read it from the classpath.
71119

72120
Resources
73121
---------
74122

75123
* [John Salatas' blog](http://jsalatas.ictpro.gr/tag/java-fst/) has some posts that describe some of his initial design
76-
decisions. I imagine that as I work on this these blog posts will become less representative of jopenFST but for the
77-
moment its pretty close
124+
decisions. The library has diverged pretty significantly from this original version, but this is still a reference.
78125
* [C++ OpenFST library](http://www.openfst.org/twiki/bin/view/FST/WebHome) describes some of the FST algorithms implemented.
79126

80-
Changes:
127+
Release History
81128
------------
82-
83-
* Adding back edges (kind of) to dramatically optimize a number of the original implementations that had poor algorithmic complexity
84-
** In my jg2p project this reduced runtime for my datasets by a factor of 30x
85-
* The original Connect/Trim implementation was wrong; fixed now.
86-
* Separated out interfaces for read-only/writeable elements (Arcs, States, Fsts) which allows
87-
convenient things like "union" symbol tables (to do mutating things without copying the entire source symbl table)
88-
* Clearly separated out algorithms by ones that mutate input args vs ones that produce new fsts
89-
* Updated some IO routines, used exceptions instead of System.err logging, some cleanup, fixed unit tests
90-
* Changed packages (although it can still deserialize FST models from the original repo)
129+
* 0.3.0 - Adding union semiring, gallic semiring, and implemented Determinization for transducers with multiple modes
130+
(functional, nonfunctional, and disambiguate) to match the behavior and options available in OpenFST.
131+
* 0.2.0 - code covering unit tests, improvements to interoperability of the text format between JOpenFST and OpenFST.
91132

92133

93134

pom.xml

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,8 @@
2222
<groupId>com.github.steveash.jopenfst</groupId>
2323
<artifactId>jopenfst</artifactId>
2424
<name>jopenfst</name>
25-
<version>0.1.1.ALPHA</version>
26-
<description>Java port of the OpenFST library; forked from the CMU Sphinx project</description>
25+
<version>0.2.0</version>
26+
<description>Partial Java port of the OpenFST library; forked from the CMU Sphinx project</description>
2727
<packaging>jar</packaging>
2828

2929
<url>https://github.com/steveash/jopenfst</url>
@@ -124,8 +124,8 @@
124124
<artifactId>maven-compiler-plugin</artifactId>
125125
<version>3.1</version>
126126
<configuration>
127-
<source>1.7</source>
128-
<target>1.7</target>
127+
<source>1.8</source>
128+
<target>1.8</target>
129129
</configuration>
130130
</plugin>
131131
<plugin>
@@ -163,6 +163,9 @@
163163
<goals>
164164
<goal>jar</goal>
165165
</goals>
166+
<configuration>
167+
<additionalparam>${javadoc.opts}</additionalparam>
168+
</configuration>
166169
</execution>
167170
</executions>
168171
</plugin>
@@ -173,6 +176,15 @@
173176
</plugins>
174177
</build>
175178
<profiles>
179+
<profile>
180+
<id>java8-doclint-disabled</id>
181+
<activation>
182+
<jdk>[1.8,)</jdk>
183+
</activation>
184+
<properties>
185+
<javadoc.opts>-Xdoclint:none</javadoc.opts>
186+
</properties>
187+
</profile>
176188

177189
<profile>
178190
<id>javadoc</id>

src/main/java/com/github/steveash/jopenfst/AbstractSymbolTable.java

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,19 +16,23 @@
1616

1717
package com.github.steveash.jopenfst;
1818

19-
import com.google.common.base.Function;
20-
import com.google.common.collect.Iterables;
21-
2219
import com.carrotsearch.hppc.IntObjectOpenHashMap;
2320
import com.carrotsearch.hppc.ObjectIntOpenHashMap;
2421
import com.carrotsearch.hppc.cursors.IntCursor;
2522
import com.carrotsearch.hppc.cursors.ObjectCursor;
2623
import com.carrotsearch.hppc.cursors.ObjectIntCursor;
2724
import com.github.steveash.jopenfst.utils.FstUtils;
25+
import com.google.common.base.Function;
26+
import com.google.common.collect.Iterables;
2827

2928
import java.util.Iterator;
3029

3130
/**
31+
* The base abstract implementation which uses carrotsearch primitive maps for optimized mappings of
32+
* int -> string and vice versa.
33+
* This implementation is effectively thread safe if no mutating operations are performed after
34+
* construction.
35+
*
3236
* @author Steve Ash
3337
*/
3438
public abstract class AbstractSymbolTable implements SymbolTable {
@@ -41,6 +45,11 @@ public String apply(ObjectCursor<String> input) {
4145
}
4246
};
4347

48+
/**
49+
* Returns the current max id mapped in this symbol table or 0 if this has no mappings
50+
* @param table
51+
* @return
52+
*/
4453
public static int maxIdIn(SymbolTable table) {
4554
int max = 0;
4655
for (ObjectIntCursor<String> cursor : table) {

src/main/java/com/github/steveash/jopenfst/Arc.java

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,15 +17,37 @@
1717
package com.github.steveash.jopenfst;
1818

1919
/**
20+
* Interface for the contract of an Arc in an FST
21+
* @see ImmutableArc
22+
* @see MutableArc
2023
* @author Steve Ash
2124
*/
2225
public interface Arc {
2326

27+
/**
28+
* Get the weight of this edge in the FST (range of values depends on the Semiring used for the FST)
29+
* @see Fst#getSemiring()
30+
* @return
31+
*/
2432
double getWeight();
2533

34+
/**
35+
* Get the index of the input symbol for this edge of the fst
36+
* @return
37+
*/
2638
int getIlabel();
2739

40+
/**
41+
* Get the index of the output symbol for this edge of the fst
42+
* @return
43+
*/
2844
int getOlabel();
2945

46+
/**
47+
* Get the reference to the next state in the FST; note that you get call `getNextState().getId()` to get the
48+
* FST state id for that state but some operations will be constructing new results and state ids will not be
49+
* consistent across them (obviously). If you are using state symbols/labels then the labels will be constistent
50+
* @return
51+
*/
3052
State getNextState();
3153
}

src/main/java/com/github/steveash/jopenfst/Edge.java

Lines changed: 0 additions & 74 deletions
This file was deleted.

0 commit comments

Comments
 (0)