Skip to content

Commit af6a6e0

Browse files
authored
OPENNLP-216: Add Detokenizer API section (#388)
* OPENNLP-216: Add Detokenizer API section * OPENNLP-216: Add Detokenizer API section (correct)
1 parent 52eb4cf commit af6a6e0

File tree

1 file changed

+66
-7
lines changed

1 file changed

+66
-7
lines changed

Diff for: opennlp-docs/src/docbkx/tokenizer.xml

+66-7
Original file line numberDiff line numberDiff line change
@@ -396,19 +396,78 @@ test -> NO_OPERATION
396396
<![CDATA[
397397
He said "This is a test".]]>
398398
</programlisting>
399-
TODO: Add documentation about the dictionary format and how to use the API. Contributions are welcome.
400399
</para>
401400
<section id="tools.tokenizer.detokenizing.api">
402401
<title>Detokenizing API</title>
403-
<para>TODO: Write documentation about the detokenizer api. Any contributions
404-
are very welcome. If you want to contribute please contact us on the mailing list
405-
or comment on the jira issue <ulink url="https://issues.apache.org/jira/browse/OPENNLP-216">OPENNLP-216</ulink>.</para>
402+
<para>
403+
The Detokenizer can be used to detokenize the tokens to String.
404+
To instantiate the Detokenizer (a rule based detokenizer)
405+
a DetokenizationDictionary (the rule of dictionary) must be created first.
406+
The following code sample shows how a rule dictionary can be loaded.
407+
<programlisting language="java">
408+
<![CDATA[
409+
InputStream dictIn = new FileInputStream("latin-detokenizer.xml");
410+
DetokenizationDictionary dict = new DetokenizationDictionary(dictIn);]]>
411+
</programlisting>
412+
After the rule dictionary is loadeed the DictionaryDetokenizer can be instantiated.
413+
<programlisting language="java">
414+
<![CDATA[
415+
Detokenizer detokenizer = new DictionaryDetokenizer(dict);]]>
416+
</programlisting>
417+
The detokenizer offers two detokenize methods,the first detokenize the input tokens into a String.
418+
<programlisting language="java">
419+
<![CDATA[
420+
String[] tokens = new String[]{"A", "co", "-", "worker", "helped", "."};
421+
String sentence = detokenizer.detokenize(tokens, null);
422+
Assert.assertEquals("A co-worker helped.", sentence);]]>
423+
</programlisting>
424+
Tokens which are connected without a space in-between can be separated by a split marker.
425+
<programlisting language="java">
426+
<![CDATA[
427+
String sentence = detokenizer.detokenize(tokens, "<SPLIT>");
428+
Assert.assertEquals("A co<SPLIT>-<SPLIT>worker helped<SPLIT>.", sentence);]]>
429+
</programlisting>
430+
The API also offers a method which simply returns operations array in the input tokens array.
431+
<programlisting language="java">
432+
<![CDATA[
433+
DetokenizationOperation[] operations = detokenizer.detokenize(tokens);
434+
for (DetokenizationOperation operation : operations) {
435+
System.out.println(operation);
436+
}]]>
437+
</programlisting>
438+
Output:
439+
<programlisting>
440+
<![CDATA[
441+
NO_OPERATION
442+
NO_OPERATION
443+
MERGE_BOTH
444+
NO_OPERATION
445+
NO_OPERATION
446+
MERGE_TO_LEFT]]>
447+
</programlisting>
448+
</para>
406449
</section>
407450
<section id="tools.tokenizer.detokenizing.dict">
408451
<title>Detokenizer Dictionary</title>
409-
<para>TODO: Write documentation about the detokenizer dictionary. Any contributions
410-
are very welcome. If you want to contribute please contact us on the mailing list
411-
or comment on the jira issue <ulink url="https://issues.apache.org/jira/browse/OPENNLP-217">OPENNLP-217</ulink>.</para>
452+
<para>
453+
Detokenization Dictionary is the rule dictionary about detokenizer.
454+
tokens - an array of tokens that should be detokenized according to an operation.
455+
operations - an array of operations which specifies which operation
456+
should be used for the provided tokens.
457+
The following code sample shows how a rule dictionary can be created.
458+
<programlisting language="java">
459+
<![CDATA[
460+
String[] tokens = new String[]{".", "!", "(", ")", "\"", "-"};
461+
Operation[] operations = new Operation[]{
462+
Operation.MOVE_LEFT,
463+
Operation.MOVE_LEFT,
464+
Operation.MOVE_RIGHT,
465+
Operation.MOVE_LEFT,
466+
Operation.RIGHT_LEFT_MATCHING,
467+
Operation.MOVE_BOTH};
468+
DetokenizationDictionary dict = new DetokenizationDictionary(tokens, operations);]]>
469+
</programlisting>
470+
</para>
412471
</section>
413472
</section>
414473
</chapter>

0 commit comments

Comments
 (0)