You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<h3id="pretraining-and-representation-learning">Pretraining and Representation-learning</h3>
423
423
<!-- SignBERT, SignBERT+, BEST. Possibly also Sign-VQ or CV-SLT can be discussed here -->
424
424
<p>In this paradigm, rather than targeting a specific task (e.g. pose-to-text), the aim is to learn a generally-useful Sign Language Understanding model or representation which can be applied or finetuned to specific downstream tasks.</p>
425
-
<!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: talk about BEST here. Results are not as good as SignBERT+ but the do some things differently. Compare/contrast things like: left+right+body triplets in BEST vs left+right only in SignBERT+ -->
426
425
<p><spanclass="citation" data-cites="hu2023SignBertPlus">Hu et al. (<ahref="#ref-hu2023SignBertPlus" role="doc-biblioref">2023</a>)</span> introduce SignBERT+, a self-supervised pretraining method for sign language understanding (SLU) based on masked modeling of pose sequences. This is an extension of their earlier SignBERT <spanclass="citation" data-cites="hu2021SignBert">(H. Hu, Zhao, et al. <ahref="#ref-hu2021SignBert" role="doc-biblioref">2021</a>)</span>, with several improvements. For pretraining they extract pose sequences from over 230k videos using MMPose <spanclass="citation" data-cites="mmpose2020">(Contributors <ahref="#ref-mmpose2020" role="doc-biblioref">2020</a>)</span>. They then perform multi-level masked modeling (joints, frames, clips) on these sequences, integrating a statistical hand model <spanclass="citation" data-cites="romero2017MANOHandModel">(Romero, Tzionas, and Black <ahref="#ref-romero2017MANOHandModel" role="doc-biblioref">2017</a>)</span> to constrain the decoder’s predictions for anatomical realism and enhanced accuracy. Validation on isolated SLR (MS-ASL <spanclass="citation" data-cites="dataset:joze2018ms">(Joze and Koller <ahref="#ref-dataset:joze2018ms" role="doc-biblioref">2019</a>)</span>, WLASL <spanclass="citation" data-cites="dataset:li2020word">(Li et al. <ahref="#ref-dataset:li2020word" role="doc-biblioref">2020</a>)</span>, SLR500 <spanclass="citation" data-cites="huang2019attention3DCNNsSLR">(Huang et al. <ahref="#ref-huang2019attention3DCNNsSLR" role="doc-biblioref">2019</a>)</span>), continuous SLR (RWTH-PHOENIX-Weather <spanclass="citation" data-cites="koller2015ContinuousSLR">(Koller, Forster, and Ney <ahref="#ref-koller2015ContinuousSLR" role="doc-biblioref">2015</a>)</span>), and SLT (RWTH-PHOENIX-Weather 2014T <spanclass="citation" data-cites="dataset:forster2014extensions cihan2018neural">(Forster et al. <ahref="#ref-dataset:forster2014extensions" role="doc-biblioref">2014</a>; Camgöz et al. <ahref="#ref-cihan2018neural" role="doc-biblioref">2018</a>)</span>) demonstrates state-of-the-art performance.</p>
426
+
<!-- BEST seems to be **B**ERT pre-training for **S**ign language recognition with coupling **T**okenization -->
427
+
<p><spanclass="citation" data-cites="Zhao2023BESTPretrainingSignLanguageRecognition">Zhao et al. (<ahref="#ref-Zhao2023BESTPretrainingSignLanguageRecognition" role="doc-biblioref">2023</a>)</span> introduce BEST (BERT Pre-training for Sign Language Recognition with Coupling Tokenization), a pre-training method based on masked modeling of pose sequences using a coupled tokenization scheme. This method takes pose triplet units (left hand, right hand, and upper-body with arms) as inputs, each tokenized into discrete codes <spanclass="citation" data-cites="van_den_Oord_2017NeuralDiscreteRepresentationLearning">(Oord, Vinyals, and Kavukcuoglu <ahref="#ref-van_den_Oord_2017NeuralDiscreteRepresentationLearning" role="doc-biblioref">2017</a>)</span> that are then coupled together. Masked modeling is then applied, where any or all components of the triplet (left hand, right hand, or upper-body) may be masked, to learn hierarchical correlations among them. Unlike <spanclass="citation" data-cites="hu2023SignBertPlus">Hu et al. (<ahref="#ref-hu2023SignBertPlus" role="doc-biblioref">2023</a>)</span>, BEST does not mask multi-frame pose sequences or individual joints. The authors validate their pre-training method on isolated sign recognition (ISR) tasks using MS-ASL <spanclass="citation" data-cites="dataset:joze2018ms">(Joze and Koller <ahref="#ref-dataset:joze2018ms" role="doc-biblioref">2019</a>)</span>, WLASL <spanclass="citation" data-cites="dataset:li2020word">(Li et al. <ahref="#ref-dataset:li2020word" role="doc-biblioref">2020</a>)</span>, SLR500 <spanclass="citation" data-cites="huang2019attention3DCNNsSLR">(Huang et al. <ahref="#ref-huang2019attention3DCNNsSLR" role="doc-biblioref">2019</a>)</span>, and NMFs-CSL <spanclass="citation" data-cites="hu2021NMFAwareSLR">(H. Hu, Zhou, et al. <ahref="#ref-hu2021NMFAwareSLR" role="doc-biblioref">2021</a>)</span>. Besides pose-to-gloss, they also experiment with video-to-gloss tasks via fusion with I3D <spanclass="citation" data-cites="carreira2017quo">(Carreira and Zisserman <ahref="#ref-carreira2017quo" role="doc-biblioref">2017</a>)</span>. Results on these datasets demonstrate state-of-the-art performance compared to previous methods and are comparable to those of SignBERT+ <spanclass="citation" data-cites="hu2023SignBertPlus">(Hu et al. <ahref="#ref-hu2023SignBertPlus" role="doc-biblioref">2023</a>)</span>.</p>
427
428
<h2id="annotation-tools">Annotation Tools</h2>
428
429
<h5id="elan---eudico-linguistic-annotator">ELAN - EUDICO Linguistic Annotator</h5>
429
430
<p><ahref="https://archive.mpi.nl/tla/elan">ELAN</a><spanclass="citation" data-cites="wittenburg2006elan">(Wittenburg et al. <ahref="#ref-wittenburg2006elan" role="doc-biblioref">2006</a>)</span> is an annotation tool for audio and video recordings. With ELAN, a user can add an unlimited number of textual annotations to audio and/or video recordings. An annotation can be a sentence, word, gloss, comment, translation, or description of any feature observed in the media. Annotations can be created on multiple layers, called tiers, which can be hierarchically interconnected. An annotation can either be time-aligned to the media or refer to other existing annotations. The content of annotations consists of Unicode text, and annotation documents are stored in an XML format (EAF). ELAN is open source (<ahref="https://en.wikipedia.org/wiki/GNU_General_Public_License#Version_3">GPLv3</a>), and installation is <ahref="https://archive.mpi.nl/tla/elan/download">available</a> for Windows, macOS, and Linux. PyMPI <spanclass="citation" data-cites="pympi-1.69">(Lubbers and Torreira <ahref="#ref-pympi-1.69" role="doc-biblioref">2013</a>)</span> allows for simple python interaction with Elan files.</p>
<p>Cao, Z., G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. “OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields.” <em>IEEE Transactions on Pattern Analysis and Machine Intelligence</em>.</p>
1130
1131
</div>
1132
+
<divid="ref-carreira2017quo">
1133
+
<p>Carreira, Joao, and Andrew Zisserman. 2017. “Quo Vadis, Action Recognition.” <em>ArXiv Preprint</em> abs/1705.07750. <ahref="https://arxiv.org/abs/1705.07750">https://arxiv.org/abs/1705.07750</a>.</p>
1134
+
</div>
1131
1135
<divid="ref-dataset:chai2014devisign">
1132
1136
<p>Chai, Xiujuan, Hanjie Wang, and Xilin Chen. 2014. “The Devisign Large Vocabulary of Chinese Sign Language Database and Baseline Evaluations.” <em>Technical Report VIPL-TR-14-SLR-001. Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS</em>.</p>
<p>Hu, Hezhen, Weichao Zhao, Wengang Zhou, and Houqiang Li. 2023. “SignBERT+: Hand-Model-Aware Self-Supervised Pre-Training for Sign Language Understanding.” <em>IEEE Transactions on Pattern Analysis and Machine Intelligence</em> 45 (9): 11221–39. <ahref="https://doi.org/10.1109/TPAMI.2023.3269220">https://doi.org/10.1109/TPAMI.2023.3269220</a>.</p>
1298
1302
</div>
1299
1303
<divid="ref-hu2021SignBert">
1300
-
<p>Hu, Hezhen, Weichao Zhao, Wengang Zhou, Yuechen Wang, and Houqiang Li. 2021. “SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition.” In <em>2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, Qc, Canada, October 10-17, 2021</em>, 11067–76. IEEE. <ahref="https://doi.org/10.1109/ICCV48922.2021.01090">https://doi.org/10.1109/ICCV48922.2021.01090</a>.</p>
1304
+
<p>Hu, Hezhen, Weichao Zhao, Wengang Zhou, Yuechen Wang, and Houqiang Li. 2021. “SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition.” In <em>Proceedings of the Ieee/Cvf International Conference on Computer Vision (Iccv)</em>, 11087–96.</p>
1301
1305
</div>
1302
1306
<divid="ref-hu2021NMFAwareSLR">
1303
1307
<p>Hu, Hezhen, Wengang Zhou, Junfu Pu, and Houqiang Li. 2021. “Global-Local Enhancement Network for NMF-Aware Sign Language Recognition.” <em>ACM Trans. Multimedia Comput. Commun. Appl.</em> 17 (3). <ahref="https://doi.org/10.1145/3436754">https://doi.org/10.1145/3436754</a>.</p>
<p>Oliveira, Marlon, Houssem Chatbri, Ylva Ferstl, Mohamed Farouk, Suzanne Little, Noel O’Connor, and A. Sutherland. 2017. “A Dataset for Irish Sign Language Recognition.” In <em>Proceedings of the Irish Machine Vision and Image Processing Conference (IMVIP)</em>.</p>
<p>Oord, Aaron van den, Oriol Vinyals, and Koray Kavukcuoglu. 2017. “Neural Discrete Representation Learning.” In <em>Proceedings of the 31st International Conference on Neural Information Processing Systems</em>, 6309–18. NIPS’17. Red Hook, NY, USA: Curran Associates Inc.</p>
1515
+
</div>
1509
1516
<divid="ref-ormel2012prosodic">
1510
1517
<p>Ormel, Ellen, and Onno Crasborn. 2012. “Prosodic Correlates of Sentences in Signed Languages: A Literature Review and Suggestions for New Types of Studies.” <em>Sign Language Studies</em> 12 (2): 279–315.</p>
<p>Zhao, Rui, Liang Zhang, Biao Fu, Cong Hu, Jinsong Su, and Yidong Chen. 2024. “Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment.” <em>Proceedings of the AAAI Conference on Artificial Intelligence</em> 38 (17): 19643–51. <ahref="https://doi.org/10.1609/aaai.v38i17.29937">https://doi.org/10.1609/aaai.v38i17.29937</a>.</p>
<p>Zhao, Weichao, Hezhen Hu, Wengang Zhou, Jiaxin Shi, and Houqiang Li. 2023. “BEST: BERT Pre-Training for Sign Language Recognition with Coupling Tokenization.” <em>Proceedings of the AAAI Conference on Artificial Intelligence</em> 37 (3): 3597–3605. <ahref="https://doi.org/10.1609/aaai.v37i3.25470">https://doi.org/10.1609/aaai.v37i3.25470</a>.</p>
1797
+
</div>
1788
1798
<divid="ref-zwitserlood2004synthetic">
1789
1799
<p>Zwitserlood, Inge, Margriet Verlinden, Johan Ros, Sanny Van Der Schoot, and T Netherlands. 2004. “Synthetic Signing for the Deaf: Esign.” In <em>Proceedings of the Conference and Workshop on Assistive Technologies for Vision and Hearing Impairment, Cvhi</em>.</p>
Copy file name to clipboardExpand all lines: index.md
+9-2
Original file line number
Diff line number
Diff line change
@@ -957,14 +957,21 @@ Finally, they used this information to construct an animation system using lette
957
957
958
958
In this paradigm, rather than targeting a specific task (e.g. pose-to-text), the aim is to learn a generally-useful Sign Language Understanding model or representation which can be applied or finetuned to specific downstream tasks.
959
959
960
-
<!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: talk about BEST here. Results are not as good as SignBERT+ but the do some things differently. Compare/contrast things like: left+right+body triplets in BEST vs left+right only in SignBERT+ -->
961
-
962
960
@hu2023SignBertPlus introduce SignBERT+, a self-supervised pretraining method for sign language understanding (SLU) based on masked modeling of pose sequences.
963
961
This is an extension of their earlier SignBERT [@hu2021SignBert], with several improvements.
964
962
For pretraining they extract pose sequences from over 230k videos using MMPose [@mmpose2020].
965
963
They then perform multi-level masked modeling (joints, frames, clips) on these sequences, integrating a statistical hand model [@romero2017MANOHandModel] to constrain the decoder's predictions for anatomical realism and enhanced accuracy.
<!-- BEST seems to be **B**ERT pre-training for **S**ign language recognition with coupling **T**okenization -->
967
+
@Zhao2023BESTPretrainingSignLanguageRecognition introduce BEST (BERT Pre-training for Sign Language Recognition with Coupling Tokenization), a pre-training method based on masked modeling of pose sequences using a coupled tokenization scheme.
968
+
This method takes pose triplet units (left hand, right hand, and upper-body with arms) as inputs, each tokenized into discrete codes [@van_den_Oord_2017NeuralDiscreteRepresentationLearning] that are then coupled together.
969
+
Masked modeling is then applied, where any or all components of the triplet (left hand, right hand, or upper-body) may be masked, to learn hierarchical correlations among them.
970
+
Unlike @hu2023SignBertPlus, BEST does not mask multi-frame pose sequences or individual joints.
971
+
The authors validate their pre-training method on isolated sign recognition (ISR) tasks using MS-ASL [@dataset:joze2018ms], WLASL [@dataset:li2020word], SLR500 [@huang2019attention3DCNNsSLR], and NMFs-CSL [@hu2021NMFAwareSLR].
972
+
Besides pose-to-gloss, they also experiment with video-to-gloss tasks via fusion with I3D [@carreira2017quo].
973
+
Results on these datasets demonstrate state-of-the-art performance compared to previous methods and are comparable to those of SignBERT+ [@hu2023SignBertPlus].
0 commit comments