Skip to content

Commit fdf2d45

Browse files
committed
Initial commit.
1 parent 6b754e0 commit fdf2d45

File tree

751 files changed

+17843
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

751 files changed

+17843
-0
lines changed

LICENSE_DB.txt

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
/*-
2+
* $Id: LICENSE_DB.txt,v 1.1 2009/03/16 20:12:29 msnover Exp $
3+
*/
4+
5+
The following is the license that applies to this copy of the Berkeley
6+
DB Java Edition software. For a license to use the Berkeley DB Java
7+
Edition software under conditions other than those described here, or
8+
to purchase support for this software, please contact Oracle at
9+
10+
11+
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
12+
/*
13+
* Copyright (c) 2002,2008 Oracle. All rights reserved.
14+
*
15+
* Redistribution and use in source and binary forms, with or without
16+
* modification, are permitted provided that the following conditions
17+
* are met:
18+
* 1. Redistributions of source code must retain the above copyright
19+
* notice, this list of conditions and the following disclaimer.
20+
* 2. Redistributions in binary form must reproduce the above copyright
21+
* notice, this list of conditions and the following disclaimer in the
22+
* documentation and/or other materials provided with the distribution.
23+
* 3. Redistributions in any form must be accompanied by information on
24+
* how to obtain complete source code for the DB software and any
25+
* accompanying software that uses the DB software. The source code
26+
* must either be included in the distribution or be available for no
27+
* more than the cost of distribution plus a nominal fee, and must be
28+
* freely redistributable under reasonable conditions. For an
29+
* executable file, complete source code means the source code for all
30+
* modules it contains. It does not include source code for modules or
31+
* files that typically accompany the major components of the operating
32+
* system on which the executable file runs.
33+
*
34+
* THIS SOFTWARE IS PROVIDED BY ORACLE ``AS IS'' AND ANY EXPRESS OR
35+
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
36+
* WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR
37+
* NON-INFRINGEMENT, ARE DISCLAIMED. IN NO EVENT SHALL ORACLE BE LIABLE
38+
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
39+
* CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
40+
* SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
41+
* BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
42+
* WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
43+
* OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN
44+
* IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
45+
*/
46+
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
47+
/***
48+
* ASM: a very small and fast Java bytecode manipulation framework
49+
* Copyright (c) 2000-2005 INRIA, France Telecom
50+
* All rights reserved.
51+
*
52+
* Redistribution and use in source and binary forms, with or without
53+
* modification, are permitted provided that the following conditions
54+
* are met:
55+
* 1. Redistributions of source code must retain the above copyright
56+
* notice, this list of conditions and the following disclaimer.
57+
* 2. Redistributions in binary form must reproduce the above copyright
58+
* notice, this list of conditions and the following disclaimer in the
59+
* documentation and/or other materials provided with the distribution.
60+
* 3. Neither the name of the copyright holders nor the names of its
61+
* contributors may be used to endorse or promote products derived from
62+
* this software without specific prior written permission.
63+
*
64+
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
65+
* AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
66+
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
67+
* ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
68+
* LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
69+
* CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
70+
* SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
71+
* INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
72+
* CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
73+
* ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
74+
* THE POSSIBILITY OF SUCH DAMAGE.
75+
*/

README.md

Lines changed: 245 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,245 @@
1+
## Introduction
2+
3+
TERp is an automatic evaluation metric for Machine Translation, which takes as input a set of reference translations, and a set of machine translation output for that same data. It aligns the MT output to the reference translations, and measures the number of 'edits' needed to transform the MT output into the reference translation. TERp is an extension of TER (Translation Edit Rate) that utilizes phrasal substitutions (using automatically generated paraphrases), stemming, synonyms, relaxed shifting constraints and other improvements. TERp is named after the University of Maryland mascot: the Terrapin, so it's pronounced "terp".
4+
5+
For a technical description of TERp, please refer to `doc/terp_description.pdf`.
6+
7+
## Installation and Setup
8+
9+
These instructions are for use on a UNIX-like operating system.
10+
11+
1. TERp requires Java version 1.5.0 or higher.
12+
13+
2. Build TERp by running `ant clean; ant` in the root of the repository.
14+
15+
3. Download and install [WordNet version 3.0](http://wordnet.princeton.edu/wordnet/download/current-version/). (**Note**: if you are on OS X, and are using macports, you can simply do `sudo port install wordnet`.)
16+
17+
4. Download the compressed paraphrase table (`unfiltered_phrasetable.txt.gz`) from the GitHub releases page to the `data` directory and uncompress it.
18+
19+
5. Several shell scripts are provided to simplify the process of running TERp. To setup these scripts run:
20+
21+
```bash
22+
bin/setup_bin.sh <PATH_TO_TERP> <PATH_TO_JAVA> <PATH_TO_WORDNET>
23+
```
24+
25+
where:
26+
27+
- `<PATH_TO_TERP>` points to the directory where you checked out this repository, such that `<PATH_TO_TERP>/bin/setup_bin.sh` exists.
28+
29+
- `<PATH_TO_JAVA>`` points to the root of the Java 1.5.0+ directory such
30+
that `<PATH_TO_JAVA>/bin/java` exists.
31+
32+
- `<PATH_TO_WORDNET>` points to the root of the WordNet 3 installation such that `<PATH_TO_WORDNET>/dict` exists. (**Note**: if you are on OS X, and you installed wordnet using macports with default options, you can set this to `/opt/local/share/WordNet-3.0`).
33+
34+
Running this script will create the following additional wrapper scripts:
35+
36+
- `bin/terp`
37+
- `bin/terpa`
38+
- `bin/terp_ter`
39+
- `bin/tercom`
40+
- `bin/create_phrasedb`
41+
- `bin/optimize_db`
42+
43+
and create the parameter file:
44+
45+
- `data/data_loc.param`
46+
47+
6. Generate a TERp compatible paraphrase table from the text-based paraphrase file you downloaded in Step 4 by running:
48+
49+
```bash
50+
bin/create_phrasedb data/unfiltered_phrasetable.txt data/phrases.db
51+
```
52+
53+
**IMPORTANT**:
54+
55+
This step could take a while and will require several gigabytes of diskspace, as the text version of the phrase table is converted to a Berkley style database. The conversion tool also expects to have 1-3 GBs of memory available. This requirement can be reduced if necessary in the bin/create_phrasedb script.
56+
57+
This step will generate a phrase table database in `data/phrases.db` and will only need to be run once. Running this step again will add to the existing database, not overwrite it.
58+
59+
The paraphrases used in this database were extracted using the pivot-based method (Bannard and Callison-Burch, 2005) with several additional filtering mechanisms to increase precision. The corpus used for extraction was an Arabic-English newswire bitext containing approximately 1 million sentences.
60+
61+
7. You can run some validation experiments to test the installation. From the root of the repository, run:
62+
63+
```bash
64+
mkdir -p test/output
65+
./bin/create_phrasedb test/sample.pt.txt test/sample.pt.db
66+
./bin/terpa test/sample.terp.param
67+
```
68+
69+
This will create a small phrase database from the file `test/sample.pt.txt` and store that database as `test/sample.pt.db`. We will use this sample database for our test since using the full database will be slower.
70+
71+
TERpA will then be run on the hypothesis and reference files in `test/` with the output placed in `test/output/` as specified in the `test/sample.terp.param` parameter file. The correct version of these output files is provided in `test/correct_output/`.
72+
73+
Running the three commands above should yield the following output (with appropriate substitutions for local file paths):
74+
75+
```bash
76+
$> mkdir -p test/output
77+
78+
$> ./bin/create_phrasedb test/sample.pt.txt test/sample.pt.db
79+
Converting Phrase Table from test/sample.pt.txt
80+
Storing Database in test/sample.pt.db
81+
Done adding phrases to test/sample.pt.db
82+
83+
$> ./bin/terpa test/sample.terp.param
84+
Loading parameters from /Users/nmadnani/work/terp/data/terpa.param
85+
Loading parameters from /Users/nmadnani/work/terp/data/data_loc.param
86+
Loading test/sample.terp.param as parameter file
87+
"test/sample.hyp.sgm" was successfully parsed as XML
88+
"test/sample.ref.sgm" was successfully parsed as XML
89+
Creating Segment Phrase Tables From DB
90+
Processing [ihned.cz/2008/09/29/36559][0001]
91+
Processing [ihned.cz/2008/09/29/36559][0002]
92+
Processing [ihned.cz/2008/09/29/36559][0003]
93+
Processing [ihned.cz/2008/09/30/36776][0001]
94+
Processing [ihned.cz/2008/09/30/36776][0002]
95+
Processing [ihned.cz/2008/09/30/36776][0003]
96+
Processing [ihned.cz/2008/09/30/36776][0004]
97+
Processing [ihned.cz/2008/09/30/36776][0005]
98+
Processing [ihned.cz/2008/09/30/36776][0006]
99+
Finished Calculating TERp
100+
Total TER: 0.48 (91.13 / 188.00)
101+
```
102+
103+
## Usage
104+
105+
The following scripts provide easy access to the TERp program, and
106+
serve as wrappers around java and the default parameter files.
107+
108+
- `bin/create_phrasedb` - This script converts a text format phrase
109+
table to a Berkeley style database that allows for fast searching of
110+
the phrase table at run time.
111+
112+
- `bin/optimize_terp` - This script allows the optimization of the edit
113+
costs of TERp to maximize correlation with reference judgments. See
114+
the online documentation for more details on its use.
115+
116+
- `bin/tercom` - This script run the original, non-TERp, version of TER
117+
0.7.25. It is not supported as part of this codebase, and its usage
118+
is no documented here (although it is essentially the same as version
119+
0.7.25 of TERcom).
120+
121+
- `bin/terp` - This script is the most basic TERp wrapper and runs TERp
122+
with default parameters only.
123+
124+
- `bin/terp_ter` - This script runs TERp with the parameters of TER,
125+
turning off stemming, synonymy, phrase substitutions and using the
126+
edit costs from TER. Due to changes in the shift search order,
127+
results may differ from the TERcom program.
128+
129+
- `bin/terpa` - This script runs TERp with parameters that were tuned
130+
as part of the NIST Metrics MATR 2008 Challenge. This TERp-A metric
131+
was optimized for Adequacy on a subset of the MT06 dataset that was
132+
annotated and distributed by NIST as a development dataset as part of
133+
the Challenge.
134+
135+
### Basic Usage
136+
137+
1. `terp`, `terpa` and `terp_ter` can be run in an identical manner, so `terp` will be used for the following examples.
138+
139+
All three programs require at least a reference file and a hypothesis (the MT output) file to score. These files can be in SGML, XML or trans format. Both files should be in the same format.
140+
141+
To run with a given reference and hypothesis:
142+
143+
```bash
144+
bin/terp -r <reference> -h <hypothesis>
145+
```
146+
147+
Options to terp can be provided either at the command-line or using parameter files (or a combination of these). Due to the large number of options available when running TERp, many options can only be specified using parameter files. Parameter files contain a series of lines, each containing a parameter name and its value. Command-line options are overridden by parameter files, and options in later parameter files are used over those in earlier parameter files.
148+
149+
Any arguments given to TERp that are not command-line flags or their arguments will be treated as parameter filenames. The reference and hypothesis file can also be specified in parameter files, so that terp could be run:
150+
151+
```bash
152+
bin/terp <param-file>
153+
```
154+
155+
where param-file is:
156+
157+
```text
158+
Reference File (filename): <reference>
159+
Hypothesis File (filename): <hypothesis>
160+
```
161+
162+
Running TERp in this manner provides minimal output. Running it with
163+
the following options will give additional scoring that may prove
164+
useful.
165+
166+
```bash
167+
bin/terp -r <reference> -h <hypothesis> -o sum,pra,nist,html,param
168+
```
169+
170+
This will cause TERp to output a summary file (`.sum`) that will list the number of times each edit occured in each segment, a human readable text file (`.pra`) containing the TERp alignment for each segment, an html version of the alignment (`.html`), as well NIST Metric MATR output (`nist`) giving system, document and segment scores for each system being scored, with the scores being scored in a series of `.scr` files. The options used in this run of TERp are also output a parameter file (`.param`) to enable easy rerunning of scoring and logging of parameters used. More details on output formats can be found in the online documentation.
171+
172+
Running TERp with no options (or incorrect options) will cause TERp to output its command-line usage.
173+
174+
2. `create_phrasedb` takes a text phrase table and inserts those phrases into a Berkeley style database.
175+
176+
```bash
177+
bin/create_phrasedb <TEXT_FILE> <DB_FILE>
178+
```
179+
180+
Where `<DB_FILE>` is the directory that will contain the files of the database. Existing databases at that location will be added to, not overwritten. If the directory does not exist, the create_phrasedb script will create it.
181+
182+
The format of the phrase table text format is shown below. Each line in the text file can be must be of the form (incorrect entries or blank entries are currently silently ignored):
183+
184+
```text
185+
COST <p>PHRASE_1</p> <p>PHRASE_2</p>
186+
or
187+
COST_1 COST_2 <p>PHRASE_1</p> <p>PHRASE_2</p>
188+
```
189+
190+
indicates that `PHRASE_1` in the reference can be substituted with an edit cost of COST with `PHRASE_2` in the hypothesis. If phrase table adjustment functions are used (as is the case in TERp-A), then it can be desirable to have be the probability of `PHRASE_1` being a paraphrase of `PHRASE_2`. This paraphrase is only allowed in one direction: i.e., "car on fire" in the reference is not considered a paraphrase of "ablaze car".
191+
192+
For example, the following line:
193+
194+
```
195+
0.15 <p>ablaze car</p> <p>car on fire</p>
196+
```
197+
198+
indicates that "ablaze car" in the reference is a paraphrase of "car on fire" with cost or probability 0.15.
199+
200+
```
201+
COST_1 COST_2 <p>PHRASE_1</p> <p>PHRASE_2</p>
202+
```
203+
204+
is equivalent to the following two lines (and is thus just a notional shortcut):
205+
206+
```
207+
COST_1 <p>PHRASE_1</p> <p>PHRASE_2</p>
208+
COST_2 <p>PHRASE_2</p> <p>PHRASE_1</p>
209+
```
210+
211+
For example, the following line in the phrase table:
212+
213+
```
214+
0.15 0.6 <p>ablaze car</p> <p>car on fire</p>
215+
```
216+
217+
is the same as having the following two lines:
218+
219+
```
220+
0.15 <p>ablaze car</p> <p>car on fire</p>
221+
0.6 <p>car on fire</p> <p>ablaze car</p>
222+
```
223+
224+
If either phrase is blank (e.g., `<p> </p>` for example) or if the two phrases are identical, the paraphrase will not be inserted into the phrase table.
225+
226+
## Citing TERp
227+
228+
References to TERp should cite:
229+
230+
```
231+
Matthew Snover, Nitin Madnani, Bonnie Dorr, and Richard Schwartz,
232+
"Fluency, Adequacy, or HTER? Exploring Different Human Judgments
233+
with a Tunable MT Metric", Proceedings of the Fourth Workshop on
234+
Statistical Machine Translation at the 12th Meeting of the European
235+
Chapter of the Association for Computational Linguistics
236+
(EACL-2009), Athens, Greece, March, 2009.
237+
```
238+
239+
## License
240+
241+
TERp is distributed under the the LGPL as described in `LICENSE.md`.
242+
243+
However, TERp uses Berkeley DB Java Edition version 3.3.75, which is distributed under a separate license (see `LICENSE_DB.txt`). While the core classes of the Berkeley DB are included in the TERp release, the source code is available [here](http://www.oracle.com/technetwork/database/database-technologies/berkeleydb/overview/index.html).
244+
245+
TERp also uses Brett Spell's Java API for WordNet Searching (JAWS), available [here](http://lyle.smu.edu/~tspell/jaws).

bin/create_phrasedb.templ

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
#!/bin/bash
2+
3+
#####################################################################
4+
###
5+
### EDIT THESE LINES APPROPRIATELLY
6+
###
7+
#####################################################################
8+
PATH_TO_JAVA="+CHANGE_WITH_PATH_TO_JAVA+/bin/java"
9+
PATH_TO_TER="+CHANGE_WITH_PATH_TO_TER+"
10+
MEM_PAR="-Xms1G -Xmx3G"
11+
12+
#####################################################################
13+
###
14+
### DO NOT EDIT BELOW HERE
15+
###
16+
#####################################################################
17+
18+
mkdir -p $2
19+
${PATH_TO_JAVA} ${MEM_PAR} -jar ${PATH_TO_TER}/dist/lib/create_phrasedb.jar $@
20+
exit $?

bin/optimize_terp.templ

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
#!/bin/bash
2+
3+
#####################################################################
4+
###
5+
### EDIT THESE TWO LINES APPROPRIATELLY
6+
###
7+
#####################################################################
8+
PATH_TO_JAVA="+CHANGE_WITH_PATH_TO_JAVA+/bin/java"n
9+
PATH_TO_TER="+CHANGE_WITH_PATH_TO_TER+"
10+
MEM_PAR="-Xms1G -Xmx3G"
11+
12+
#####################################################################
13+
###
14+
### DO NOT EDIT BELOW HERE
15+
###
16+
#####################################################################
17+
18+
${PATH_TO_JAVA} ${MEM_PAR} -jar ${PATH_TO_TER}/dist/lib/optimize_terp.jar $@
19+
20+
exit $?

bin/setup_bin.sh

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
#!/bin/bash
2+
3+
if [ $# -eq 3 ] ; then
4+
INSTALL_PATH=$1
5+
JAVA_PATH=$2
6+
WORDNET_PATH=$3
7+
E1="s|[+]CHANGE_WITH_PATH_TO_TER[+]|${INSTALL_PATH}|g";
8+
E2="s|[+]CHANGE_WITH_PATH_TO_JAVA[+]|${JAVA_PATH}|g";
9+
E3="s|[+]CHANGE_WITH_PATH_TO_WORDNET[+]|${WORDNET_PATH}|g";
10+
for f in "bin/terp" "bin/terpa" "bin/create_phrasedb" "bin/optimize_terp" "bin/tercom" "bin/terp_ter" "data/data_loc.param"
11+
do
12+
echo Creating script $f from $f.templ
13+
sed -e ${E1} -e ${E2} -e ${E3} ${INSTALL_PATH}/${f}.templ > ${INSTALL_PATH}/${f}
14+
chmod 755 ${INSTALL_PATH}/${f}
15+
done
16+
17+
for f in "data/data_loc.param"
18+
do
19+
echo Creating parameter file $f from $f.templ
20+
sed -e ${E1} -e ${E2} -e ${E3} ${INSTALL_PATH}/${f}.templ > ${INSTALL_PATH}/${f}
21+
done
22+
else
23+
echo "usage: setup_bin.sh PATH_TO_TERP PATH_TO_JAVA PATH_TO_WORDNET";
24+
echo " PATH_TO_TERP is the path to the TERp installation root directory";
25+
echo " PATH_TO_JAVA is the path to the java installation (so that PATH_TO_JAVA/bin/java exists)";
26+
echo " PATH_TO_WORDNET is the path to the WordNet 3.0 installation (so that PATH_TO_WORDNET/dict/ exists)";
27+
exit 1;
28+
fi
29+

0 commit comments

Comments
 (0)