You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p><spanclass="citation" data-cites="shalev2022ham2pose">Arkushin, Moryossef, and Fried (<ahref="#ref-shalev2022ham2pose" role="doc-biblioref">2023</a>)</span> proposed Ham2Pose, a model to animate HamNoSys into a sequence of poses. They first encode the HamNoSys into a meaningful “context” representation using a transform encoder, and use it to predict the length of the pose sequence to be generated. Then, starting from a still frame they used an iterative non-autoregressive decoder to gradually refine the sign over <spanclass="math inline"><em>T</em></span> steps, In each time step <spanclass="math inline"><em>t</em></span> from <spanclass="math inline"><em>T</em></span> to <spanclass="math inline">1</span>, the model predicts the required change from step <spanclass="math inline"><em>t</em></span> to step <spanclass="math inline"><em>t</em> − 1</span>. After <spanclass="math inline"><em>T</em></span> steps, the pose generator outputs the final pose sequence. Their model outperformed previous methods like <spanclass="citation" data-cites="saunders2020progressive">Saunders, Camgöz, and Bowden (<ahref="#ref-saunders2020progressive" role="doc-biblioref">2020</a><ahref="#ref-saunders2020progressive" role="doc-biblioref">b</a>)</span>, animating HamNoSys into more realistic sign language sequences.</p>
<p>Methods for automatic evaluation of sign language processing are typically dependent only on the output and independent of the input.</p>
404
+
<h5id="text-output">Text output</h5>
405
+
<p>For tasks that output spoken language text, standard machine translation metrics such as BLEU, chrF, or COMET are commonly used. <!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: examples --></p>
406
+
<h5id="gloss-output">Gloss Output</h5>
407
+
<p>Gloss outputs can be automatically scored as well, though not without issues. In particular, <spanclass="citation" data-cites="muller-etal-2023-considerations">Müller et al. (<ahref="#ref-muller-etal-2023-considerations" role="doc-biblioref">2023</a>)</span> analysed this and provide a series of recommendations (see the section on “Glosses”, above).</p>
408
+
<h5id="pose-output">Pose Output</h5>
409
+
<p>For translation from spoken languages to signed languages, automatic evaluation metrics are an open line of research, though some metrics involving back-translation have been developed (see Text-to-Pose and Notation-to-Pose, above). <!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: "Progressive Transformers for End-to-End Sign Language Production" is the one cited in Towards Fast and High-Quality Sign Language Production as a "widely-used setting" for backtranslation. --><!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: Towards Fast and High-Quality Sign Language Production uses back-translation. Discuss results and issues. --></p>
410
+
<!-- These three papers are cited in @shalev2022ham2pose as previous work using APE -->
411
+
<p>Naively, works in this domain have used metrics such as Mean Squared Error (MSE) or Average Position Error (APE) for pose outputs [ahuja2019Language2PoseNaturalLanguage;ghosh2021SynthesisCompositionalAnimations;petrovich2022TEMOSGeneratingDiverse]. However, these metrics have significant limitations for Sign Language Production.</p>
412
+
<p>For example, MSE and APE do not account for variations in sequence length. In practice, the same sign will not always take exactly the same amount of time to produce, even by the same signer. To address time variation, <spanclass="citation" data-cites="huang2021towards">Huang et al. (<ahref="#ref-huang2021towards" role="doc-biblioref">2021</a>)</span> introduced a metric for pose sequence outputs based on measuring the distance between generated and reference pose sequences at the joint level using dynamic time warping, termed DTW-MJE (Dynamic Time Warping - Mean Joint Error). However, this metric did not clearly address how to handle missing keypoints. <spanclass="citation" data-cites="shalev2022ham2pose">Arkushin, Moryossef, and Fried (<ahref="#ref-shalev2022ham2pose" role="doc-biblioref">2023</a>)</span> experimented with multiple evaluation methods, and proposed adding a distance function that accounts for these missing keypoints. They applied this function with normalization of keypoints, naming their metric nDTW-MJE. <!-- They don't explicitly explain that the lowercase n is for "normalized keypoints" but that's my guess. -Colin --></p>
<p>As an alternative to gloss sequences, <spanclass="citation" data-cites="kim-etal-2024-signbleu-automatic">Kim et al. (<ahref="#ref-kim-etal-2024-signbleu-automatic" role="doc-biblioref">2024</a>)</span> proposed a multi-channel output representation for sign languages and introduced SignBLEU, a BLEU-like scoring method for these outputs. Instead of a single linear sequence of glosses, the representation segments sign language output into multiple linear channels, each containing discrete “blocks”. These blocks represent both manual and non-manual signals, for example, one for each hand and others for various non-manual signals like eyebrow movements. The blocks are then converted to n-grams: temporal grams capture sequences within a channel, and channel grams capture co-occurrences across channels. The SignBLEU score is then calculated for these n-grams of varying orders. They evaluated SignBLEU on the DGS Corpus v3.0 <spanclass="citation" data-cites="dataset:Konrad_2020_dgscorpus_3 dataset:prillwitz2008dgs">(Konrad et al. <ahref="#ref-dataset:Konrad_2020_dgscorpus_3" role="doc-biblioref">2020</a>; Prillwitz et al. <ahref="#ref-dataset:prillwitz2008dgs" role="doc-biblioref">2008</a>)</span>, NIASL2021 <spanclass="citation" data-cites="dataset:huerta-enochian-etal-2022-kosign">(Huerta-Enochian et al. <ahref="#ref-dataset:huerta-enochian-etal-2022-kosign" role="doc-biblioref">2022</a>)</span>, and NCSLGR <spanclass="citation" data-cites="dataset:Neidle_2020_NCSLGR_ISLRN Vogler2012ASLLRP_data_access_interface">(Neidle and Sclaroff <ahref="#ref-dataset:Neidle_2020_NCSLGR_ISLRN" role="doc-biblioref">2012</a>; Vogler and Neidle <ahref="#ref-Vogler2012ASLLRP_data_access_interface" role="doc-biblioref">2012</a>)</span> datasets, comparing it with single-channel (gloss) metrics such as BLEU, TER, chrF, and METEOR, as well as human evaluations by native signers. The authors found that SignBLEU consistently correlated better to human evaluation than these alternatives. However, one limitation of this approach is the lack of suitable datasets. The authors reviewed a number of sign language corpora, noting the relative scarcity of multi-channel annotations. The <ahref="https://github.com/eq4all-projects/SignBLEU">source code for SignBLEU</a> is available. As with SacreBLEU <spanclass="citation" data-cites="post-2018-call-sacrebleu">(Post <ahref="#ref-post-2018-call-sacrebleu" role="doc-biblioref">2018</a>)</span>, the code can generate “version signature” strings summarizing key parameters, to enhance reproducibility.</p>
415
+
<!-- (and SignBLEU can be installed and run! https://colab.research.google.com/drive/1mRCSBQSvjkoSOz5MFiOko1CgtamuCVYO?usp=sharing) -->
402
416
<h3id="sign-language-retrieval">Sign Language Retrieval</h3>
403
417
<p>Sign Language Retrieval is the task of finding a particular data item, given some input. In contrast to translation, generation or production tasks, there can exist a correct corresponding piece of data already, and the task is to find it out of many, if it exists. Metrics used include retrieval at Rank K (R@K, higher is better) and median rank (MedR, lower is better).</p>
<p>Huang, Wencan, Wenwen Pan, Zhou Zhao, and Qi Tian. 2021. “Towards Fast and High-Quality Sign Language Production.” In <em>Proceedings of the 29th Acm International Conference on Multimedia</em>, 3172–81.</p>
<p>Huerta-Enochian, Mathew, Du Hui Lee, Hye Jin Myung, Kang Suk Byun, and Jun Woo Lee. 2022. “KoSign Sign Language Translation Project: Introducing the NIASL2021 Dataset.” In <em>Proceedings of the 7th International Workshop on Sign Language Translation and Avatar Technology: The Junction of the Visual and the Textual: Challenges and Perspectives</em>, 59–66. Marseille, France: European Language Resources Association. <ahref="https://aclanthology.org/2022.sltat-1.9">https://aclanthology.org/2022.sltat-1.9</a>.</p>
1365
+
</div>
1349
1366
<divid="ref-humphries2016avoiding">
1350
1367
<p>Humphries, Tom, Poorna Kushalnagar, Gaurav Mathur, Donna Jo Napoli, Carol Padden, Christian Rathmann, and Scott Smith. 2016. “Avoiding Linguistic Neglect of Deaf Children.” <em>Social Service Review</em> 90 (4): 589–619.</p>
<p>Kezar, Lee, Jesse Thomason, and Zed Sehyr. 2023. “Improving Sign Recognition with Phonology.” In <em>Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics</em>, 2732–7. Dubrovnik, Croatia: Association for Computational Linguistics. <ahref="https://aclanthology.org/2023.eacl-main.200">https://aclanthology.org/2023.eacl-main.200</a>.</p>
1405
1422
</div>
1423
+
<divid="ref-kim-etal-2024-signbleu-automatic">
1424
+
<p>Kim, Jung-Ho, Mathew John Huerta-Enochian, Changyong Ko, and Du Hui Lee. 2024. “SignBLEU: Automatic Evaluation of Multi-Channel Sign Language Translation.” In <em>Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (Lrec-Coling 2024)</em>, 14796–14811. Torino, Italia: ELRA; ICCL. <ahref="https://aclanthology.org/2024.lrec-main.1289">https://aclanthology.org/2024.lrec-main.1289</a>.</p>
1425
+
</div>
1406
1426
<divid="ref-kimmelman2014information">
1407
1427
<p>Kimmelman, Vadim. 2014. “Information Structure in Russian Sign Language and Sign Language of the Netherlands.” <em>Sign Language & Linguistics</em> 18 (1): 142–50.</p>
<p>Koller, Oscar, Jens Forster, and Hermann Ney. 2015. “Continuous Sign Language Recognition: Towards Large Vocabulary Statistical Recognition Systems Handling Multiple Signers.” <em>Computer Vision and Image Understanding</em> 141: 108–25. <ahref="https://doi.org/https://doi.org/10.1016/j.cviu.2015.09.013">https://doi.org/https://doi.org/10.1016/j.cviu.2015.09.013</a>.</p>
1423
1443
</div>
1444
+
<divid="ref-dataset:Konrad_2020_dgscorpus_3">
1445
+
<p>Konrad, Reiner, Thomas Hanke, Gabriele Langer, Dolly Blanck, Julian Bleicken, Ilona Hofmann, Olga Jeziorski, et al. 2020. “MEINE DGS – Annotiert. Öffentliches Korpus Der Deutschen Gebärdensprache, 3. Release / MY DGS – Annotated. Public Corpus of German Sign Language, 3rd Release.” Languageresource. Universität Hamburg. <ahref="https://doi.org/10.25592/dgs.corpus-3.0">https://doi.org/10.25592/dgs.corpus-3.0</a>.</p>
<p>Napier, Jemina, and Lorraine Leeson. 2016. <em>Sign Language in Action</em>. London: Palgrave Macmillan.</p>
1540
1563
</div>
1564
+
<divid="ref-dataset:Neidle_2020_NCSLGR_ISLRN">
1565
+
<p>Neidle, Carol, and Stan Sclaroff. 2012. “National Center for Sign Language and Gesture Resources (Ncslgr) Corpus. ISLRN 833-505-711-564-4.” Languageresource. Boston University. <ahref="https://www.islrn.org/resources/833-505-711-564-4/">https://www.islrn.org/resources/833-505-711-564-4/</a>.</p>
1566
+
</div>
1541
1567
<divid="ref-neidle2001signstream">
1542
1568
<p>Neidle, Carol, Stan Sclaroff, and Vassilis Athitsos. 2001. “SignStream: A Tool for Linguistic and Computer Vision Research on Visual-Gestural Language Data.” <em>Behavior Research Methods, Instruments, & Computers</em> 33 (3): 311–20.</p>
<p>Vogler, Christian, and Siome Goldenstein. 2005. “Analysis of Facial Expressions in American Sign Language.” In <em>Proc, of the 3rd Int. Conf. On Universal Access in Human-Computer Interaction, Springer</em>.</p>
<p>Vogler, Christian, and C. Neidle. 2012. “A New Web Interface to Facilitate Access to Corpora: Development of the ASLLRP Data Access Interface.” In. <ahref="https://api.semanticscholar.org/CorpusID:58305327">https://api.semanticscholar.org/CorpusID:58305327</a>.</p>
1794
+
</div>
1766
1795
<divid="ref-dataset:von2007towards">
1767
1796
<p>Von Agris, Ulrich, and Karl-Friedrich Kraiss. 2007. “Towards a Video Corpus for Signer-Independent Continuous Sign Language Recognition.” <em>Gesture in Human-Computer Interaction and Simulation, Lisbon, Portugal, May</em> 11.</p>
Copy file name to clipboardExpand all lines: index.md
+49-1
Original file line number
Diff line number
Diff line change
@@ -841,6 +841,7 @@ They apply several low-resource machine translation techniques used to improve s
841
841
Their findings validate the use of an intermediate text representation for signed language translation, and pave the way for including sign language translation in natural language processing research.
842
842
843
843
#### Text-to-Notation
844
+
844
845
@jiang2022machine also explore the reverse translation direction, i.e., text to SignWriting translation.
845
846
They conduct experiments under a same condition of their multilingual SignWriting to text (4 language pairs) experiment, and again propose a neural factored machine translation approach to decode the graphemes and their position separately.
846
847
They borrow BLEU from spoken language translation to evaluate the predicted graphemes and mean absolute error to evaluate the positional numbers.
@@ -850,14 +851,61 @@ They borrow BLEU from spoken language translation to evaluate the predicted grap
850
851
---
851
852
852
853
#### Notation-to-Pose
854
+
853
855
@shalev2022ham2pose proposed Ham2Pose, a model to animate HamNoSys into a sequence of poses.
854
856
They first encode the HamNoSys into a meaningful "context" representation using a transform encoder,
855
857
and use it to predict the length of the pose sequence to be generated.
856
858
Then, starting from a still frame they used an iterative non-autoregressive decoder to gradually refine the sign over $T$ steps,
857
859
In each time step $t$ from $T$ to $1$, the model predicts the required change from step $t$ to step $t-1$. After $T$ steps, the pose generator outputs the final pose sequence.
858
-
Their model outperformed previous methods like @saunders2020progressive, animating HamNoSys into more realistic sign language sequences.
860
+
Their model outperformed previous methods like @saunders2020progressive, animating HamNoSys into more realistic sign language sequences.
861
+
862
+
#### Evaluation Metrics
863
+
864
+
Methods for automatic evaluation of sign language processing are typically dependent only on the output and independent of the input.
865
+
866
+
##### Text output
867
+
868
+
For tasks that output spoken language text, standard machine translation metrics such as BLEU, chrF, or COMET are commonly used.
Gloss outputs can be automatically scored as well, though not without issues.
874
+
In particular, @muller-etal-2023-considerations analysed this and provide a series of recommendations (see the section on "Glosses", above).
875
+
876
+
##### Pose Output
877
+
878
+
For translation from spoken languages to signed languages, automatic evaluation metrics are an open line of research, though some metrics involving back-translation have been developed (see Text-to-Pose and Notation-to-Pose, above).
879
+
<!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: "Progressive Transformers for End-to-End Sign Language Production" is the one cited in Towards Fast and High-Quality Sign Language Production as a "widely-used setting" for backtranslation. -->
880
+
<!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: Towards Fast and High-Quality Sign Language Production uses back-translation. Discuss results and issues. -->
881
+
882
+
<!-- These three papers are cited in @shalev2022ham2pose as previous work using APE -->
883
+
Naively, works in this domain have used metrics such as Mean Squared Error (MSE) or Average Position Error (APE) for pose outputs [ahuja2019Language2PoseNaturalLanguage;ghosh2021SynthesisCompositionalAnimations;petrovich2022TEMOSGeneratingDiverse].
884
+
However, these metrics have significant limitations for Sign Language Production.
885
+
886
+
For example, MSE and APE do not account for variations in sequence length.
887
+
In practice, the same sign will not always take exactly the same amount of time to produce, even by the same signer.
888
+
To address time variation, @huang2021towards introduced a metric for pose sequence outputs based on measuring the distance between generated and reference pose sequences at the joint level using dynamic time warping, termed DTW-MJE (Dynamic Time Warping - Mean Joint Error).
889
+
However, this metric did not clearly address how to handle missing keypoints.
890
+
@shalev2022ham2pose experimented with multiple evaluation methods, and proposed adding a distance function that accounts for these missing keypoints.
891
+
They applied this function with normalization of keypoints, naming their metric nDTW-MJE.
892
+
<!-- They don't explicitly explain that the lowercase n is for "normalized keypoints" but that's my guess. -Colin -->
893
+
894
+
##### Multi-Channel Block output
859
895
896
+
As an alternative to gloss sequences, @kim-etal-2024-signbleu-automatic proposed a multi-channel output representation for sign languages and introduced SignBLEU, a BLEU-like scoring method for these outputs.
897
+
Instead of a single linear sequence of glosses, the representation segments sign language output into multiple linear channels, each containing discrete "blocks".
898
+
These blocks represent both manual and non-manual signals, for example, one for each hand and others for various non-manual signals like eyebrow movements.
899
+
The blocks are then converted to n-grams: temporal grams capture sequences within a channel, and channel grams capture co-occurrences across channels.
900
+
The SignBLEU score is then calculated for these n-grams of varying orders.
901
+
They evaluated SignBLEU on the DGS Corpus v3.0 [@dataset:Konrad_2020_dgscorpus_3; @dataset:prillwitz2008dgs], NIASL2021 [@dataset:huerta-enochian-etal-2022-kosign], and NCSLGR [@dataset:Neidle_2020_NCSLGR_ISLRN; @Vogler2012ASLLRP_data_access_interface] datasets, comparing it with single-channel (gloss) metrics such as BLEU, TER, chrF, and METEOR, as well as human evaluations by native signers.
902
+
The authors found that SignBLEU consistently correlated better to human evaluation than these alternatives.
903
+
However, one limitation of this approach is the lack of suitable datasets.
904
+
The authors reviewed a number of sign language corpora, noting the relative scarcity of multi-channel annotations.
905
+
The [source code for SignBLEU](https://github.com/eq4all-projects/SignBLEU) is available.
906
+
As with SacreBLEU [@post-2018-call-sacrebleu], the code can generate "version signature" strings summarizing key parameters, to enhance reproducibility.
860
907
908
+
<!-- (and SignBLEU can be installed and run! https://colab.research.google.com/drive/1mRCSBQSvjkoSOz5MFiOko1CgtamuCVYO?usp=sharing) -->
0 commit comments