@@ -396,19 +396,78 @@ test -> NO_OPERATION
396
396
<![CDATA[
397
397
He said "This is a test".]]>
398
398
</programlisting >
399
- TODO: Add documentation about the dictionary format and how to use the API. Contributions are welcome.
400
399
</para >
401
400
<section id =" tools.tokenizer.detokenizing.api" >
402
401
<title >Detokenizing API</title >
403
- <para >TODO: Write documentation about the detokenizer api. Any contributions
404
- are very welcome. If you want to contribute please contact us on the mailing list
405
- or comment on the jira issue <ulink url =" https://issues.apache.org/jira/browse/OPENNLP-216" >OPENNLP-216</ulink >.</para >
402
+ <para >
403
+ The Detokenizer can be used to detokenize the tokens to String.
404
+ To instantiate the Detokenizer (a rule based detokenizer)
405
+ a DetokenizationDictionary (the rule of dictionary) must be created first.
406
+ The following code sample shows how a rule dictionary can be loaded.
407
+ <programlisting language =" java" >
408
+ <![CDATA[
409
+ InputStream dictIn = new FileInputStream("latin-detokenizer.xml");
410
+ DetokenizationDictionary dict = new DetokenizationDictionary(dictIn);]]>
411
+ </programlisting >
412
+ After the rule dictionary is loadeed the DictionaryDetokenizer can be instantiated.
413
+ <programlisting language =" java" >
414
+ <![CDATA[
415
+ Detokenizer detokenizer = new DictionaryDetokenizer(dict);]]>
416
+ </programlisting >
417
+ The detokenizer offers two detokenize methods,the first detokenize the input tokens into a String.
418
+ <programlisting language =" java" >
419
+ <![CDATA[
420
+ String[] tokens = new String[]{"A", "co", "-", "worker", "helped", "."};
421
+ String sentence = detokenizer.detokenize(tokens, null);
422
+ Assert.assertEquals("A co-worker helped.", sentence);]]>
423
+ </programlisting >
424
+ Tokens which are connected without a space in-between can be separated by a split marker.
425
+ <programlisting language =" java" >
426
+ <![CDATA[
427
+ String sentence = detokenizer.detokenize(tokens, "<SPLIT>");
428
+ Assert.assertEquals("A co<SPLIT>-<SPLIT>worker helped<SPLIT>.", sentence);]]>
429
+ </programlisting >
430
+ The API also offers a method which simply returns operations array in the input tokens array.
431
+ <programlisting language =" java" >
432
+ <![CDATA[
433
+ DetokenizationOperation[] operations = detokenizer.detokenize(tokens);
434
+ for (DetokenizationOperation operation : operations) {
435
+ System.out.println(operation);
436
+ }]]>
437
+ </programlisting >
438
+ Output:
439
+ <programlisting >
440
+ <![CDATA[
441
+ NO_OPERATION
442
+ NO_OPERATION
443
+ MERGE_BOTH
444
+ NO_OPERATION
445
+ NO_OPERATION
446
+ MERGE_TO_LEFT]]>
447
+ </programlisting >
448
+ </para >
406
449
</section >
407
450
<section id =" tools.tokenizer.detokenizing.dict" >
408
451
<title >Detokenizer Dictionary</title >
409
- <para >TODO: Write documentation about the detokenizer dictionary. Any contributions
410
- are very welcome. If you want to contribute please contact us on the mailing list
411
- or comment on the jira issue <ulink url =" https://issues.apache.org/jira/browse/OPENNLP-217" >OPENNLP-217</ulink >.</para >
452
+ <para >
453
+ Detokenization Dictionary is the rule dictionary about detokenizer.
454
+ tokens - an array of tokens that should be detokenized according to an operation.
455
+ operations - an array of operations which specifies which operation
456
+ should be used for the provided tokens.
457
+ The following code sample shows how a rule dictionary can be created.
458
+ <programlisting language =" java" >
459
+ <![CDATA[
460
+ String[] tokens = new String[]{".", "!", "(", ")", "\"", "-"};
461
+ Operation[] operations = new Operation[]{
462
+ Operation.MOVE_LEFT,
463
+ Operation.MOVE_LEFT,
464
+ Operation.MOVE_RIGHT,
465
+ Operation.MOVE_LEFT,
466
+ Operation.RIGHT_LEFT_MATCHING,
467
+ Operation.MOVE_BOTH};
468
+ DetokenizationDictionary dict = new DetokenizationDictionary(tokens, operations);]]>
469
+ </programlisting >
470
+ </para >
412
471
</section >
413
472
</section >
414
473
</chapter >
0 commit comments