Skip to content

Commit e3b1b6e

Browse files
committed
Deploying to gh-pages from @ a901d72 🚀
1 parent 18f772b commit e3b1b6e

File tree

3 files changed

+22
-5
lines changed

3 files changed

+22
-5
lines changed

index.html

+12-2
Original file line numberDiff line numberDiff line change
@@ -422,8 +422,9 @@ <h4 id="production">Production</h4>
422422
<h3 id="pretraining-and-representation-learning">Pretraining and Representation-learning</h3>
423423
<!-- SignBERT, SignBERT+, BEST. Possibly also Sign-VQ or CV-SLT can be discussed here -->
424424
<p>In this paradigm, rather than targeting a specific task (e.g. pose-to-text), the aim is to learn a generally-useful Sign Language Understanding model or representation which can be applied or finetuned to specific downstream tasks.</p>
425-
<!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: talk about BEST here. Results are not as good as SignBERT+ but the do some things differently. Compare/contrast things like: left+right+body triplets in BEST vs left+right only in SignBERT+ -->
426425
<p><span class="citation" data-cites="hu2023SignBertPlus">Hu et al. (<a href="#ref-hu2023SignBertPlus" role="doc-biblioref">2023</a>)</span> introduce SignBERT+, a self-supervised pretraining method for sign language understanding (SLU) based on masked modeling of pose sequences. This is an extension of their earlier SignBERT <span class="citation" data-cites="hu2021SignBert">(H. Hu, Zhao, et al. <a href="#ref-hu2021SignBert" role="doc-biblioref">2021</a>)</span>, with several improvements. For pretraining they extract pose sequences from over 230k videos using MMPose <span class="citation" data-cites="mmpose2020">(Contributors <a href="#ref-mmpose2020" role="doc-biblioref">2020</a>)</span>. They then perform multi-level masked modeling (joints, frames, clips) on these sequences, integrating a statistical hand model <span class="citation" data-cites="romero2017MANOHandModel">(Romero, Tzionas, and Black <a href="#ref-romero2017MANOHandModel" role="doc-biblioref">2017</a>)</span> to constrain the decoder’s predictions for anatomical realism and enhanced accuracy. Validation on isolated SLR (MS-ASL <span class="citation" data-cites="dataset:joze2018ms">(Joze and Koller <a href="#ref-dataset:joze2018ms" role="doc-biblioref">2019</a>)</span>, WLASL <span class="citation" data-cites="dataset:li2020word">(Li et al. <a href="#ref-dataset:li2020word" role="doc-biblioref">2020</a>)</span>, SLR500 <span class="citation" data-cites="huang2019attention3DCNNsSLR">(Huang et al. <a href="#ref-huang2019attention3DCNNsSLR" role="doc-biblioref">2019</a>)</span>), continuous SLR (RWTH-PHOENIX-Weather <span class="citation" data-cites="koller2015ContinuousSLR">(Koller, Forster, and Ney <a href="#ref-koller2015ContinuousSLR" role="doc-biblioref">2015</a>)</span>), and SLT (RWTH-PHOENIX-Weather 2014T <span class="citation" data-cites="dataset:forster2014extensions cihan2018neural">(Forster et al. <a href="#ref-dataset:forster2014extensions" role="doc-biblioref">2014</a>; Camgöz et al. <a href="#ref-cihan2018neural" role="doc-biblioref">2018</a>)</span>) demonstrates state-of-the-art performance.</p>
426+
<!-- BEST seems to be **B**ERT pre-training for **S**ign language recognition with coupling **T**okenization -->
427+
<p><span class="citation" data-cites="Zhao2023BESTPretrainingSignLanguageRecognition">Zhao et al. (<a href="#ref-Zhao2023BESTPretrainingSignLanguageRecognition" role="doc-biblioref">2023</a>)</span> introduce BEST (BERT Pre-training for Sign Language Recognition with Coupling Tokenization), a pre-training method based on masked modeling of pose sequences using a coupled tokenization scheme. This method takes pose triplet units (left hand, right hand, and upper-body with arms) as inputs, each tokenized into discrete codes <span class="citation" data-cites="van_den_Oord_2017NeuralDiscreteRepresentationLearning">(Oord, Vinyals, and Kavukcuoglu <a href="#ref-van_den_Oord_2017NeuralDiscreteRepresentationLearning" role="doc-biblioref">2017</a>)</span> that are then coupled together. Masked modeling is then applied, where any or all components of the triplet (left hand, right hand, or upper-body) may be masked, to learn hierarchical correlations among them. Unlike <span class="citation" data-cites="hu2023SignBertPlus">Hu et al. (<a href="#ref-hu2023SignBertPlus" role="doc-biblioref">2023</a>)</span>, BEST does not mask multi-frame pose sequences or individual joints. The authors validate their pre-training method on isolated sign recognition (ISR) tasks using MS-ASL <span class="citation" data-cites="dataset:joze2018ms">(Joze and Koller <a href="#ref-dataset:joze2018ms" role="doc-biblioref">2019</a>)</span>, WLASL <span class="citation" data-cites="dataset:li2020word">(Li et al. <a href="#ref-dataset:li2020word" role="doc-biblioref">2020</a>)</span>, SLR500 <span class="citation" data-cites="huang2019attention3DCNNsSLR">(Huang et al. <a href="#ref-huang2019attention3DCNNsSLR" role="doc-biblioref">2019</a>)</span>, and NMFs-CSL <span class="citation" data-cites="hu2021NMFAwareSLR">(H. Hu, Zhou, et al. <a href="#ref-hu2021NMFAwareSLR" role="doc-biblioref">2021</a>)</span>. Besides pose-to-gloss, they also experiment with video-to-gloss tasks via fusion with I3D <span class="citation" data-cites="carreira2017quo">(Carreira and Zisserman <a href="#ref-carreira2017quo" role="doc-biblioref">2017</a>)</span>. Results on these datasets demonstrate state-of-the-art performance compared to previous methods and are comparable to those of SignBERT+ <span class="citation" data-cites="hu2023SignBertPlus">(Hu et al. <a href="#ref-hu2023SignBertPlus" role="doc-biblioref">2023</a>)</span>.</p>
427428
<h2 id="annotation-tools">Annotation Tools</h2>
428429
<h5 id="elan---eudico-linguistic-annotator">ELAN - EUDICO Linguistic Annotator</h5>
429430
<p><a href="https://archive.mpi.nl/tla/elan">ELAN</a> <span class="citation" data-cites="wittenburg2006elan">(Wittenburg et al. <a href="#ref-wittenburg2006elan" role="doc-biblioref">2006</a>)</span> is an annotation tool for audio and video recordings. With ELAN, a user can add an unlimited number of textual annotations to audio and/or video recordings. An annotation can be a sentence, word, gloss, comment, translation, or description of any feature observed in the media. Annotations can be created on multiple layers, called tiers, which can be hierarchically interconnected. An annotation can either be time-aligned to the media or refer to other existing annotations. The content of annotations consists of Unicode text, and annotation documents are stored in an XML format (EAF). ELAN is open source (<a href="https://en.wikipedia.org/wiki/GNU_General_Public_License#Version_3">GPLv3</a>), and installation is <a href="https://archive.mpi.nl/tla/elan/download">available</a> for Windows, macOS, and Linux. PyMPI <span class="citation" data-cites="pympi-1.69">(Lubbers and Torreira <a href="#ref-pympi-1.69" role="doc-biblioref">2013</a>)</span> allows for simple python interaction with Elan files.</p>
@@ -1128,6 +1129,9 @@ <h2 id="references">References</h2>
11281129
<div id="ref-pose:cao2018openpose">
11291130
<p>Cao, Z., G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. “OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields.” <em>IEEE Transactions on Pattern Analysis and Machine Intelligence</em>.</p>
11301131
</div>
1132+
<div id="ref-carreira2017quo">
1133+
<p>Carreira, Joao, and Andrew Zisserman. 2017. “Quo Vadis, Action Recognition.” <em>ArXiv Preprint</em> abs/1705.07750. <a href="https://arxiv.org/abs/1705.07750">https://arxiv.org/abs/1705.07750</a>.</p>
1134+
</div>
11311135
<div id="ref-dataset:chai2014devisign">
11321136
<p>Chai, Xiujuan, Hanjie Wang, and Xilin Chen. 2014. “The Devisign Large Vocabulary of Chinese Sign Language Database and Baseline Evaluations.” <em>Technical Report VIPL-TR-14-SLR-001. Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS</em>.</p>
11331137
</div>
@@ -1297,7 +1301,7 @@ <h2 id="references">References</h2>
12971301
<p>Hu, Hezhen, Weichao Zhao, Wengang Zhou, and Houqiang Li. 2023. “SignBERT+: Hand-Model-Aware Self-Supervised Pre-Training for Sign Language Understanding.” <em>IEEE Transactions on Pattern Analysis and Machine Intelligence</em> 45 (9): 11221–39. <a href="https://doi.org/10.1109/TPAMI.2023.3269220">https://doi.org/10.1109/TPAMI.2023.3269220</a>.</p>
12981302
</div>
12991303
<div id="ref-hu2021SignBert">
1300-
<p>Hu, Hezhen, Weichao Zhao, Wengang Zhou, Yuechen Wang, and Houqiang Li. 2021. “SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition.” In <em>2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, Qc, Canada, October 10-17, 2021</em>, 11067–76. IEEE. <a href="https://doi.org/10.1109/ICCV48922.2021.01090">https://doi.org/10.1109/ICCV48922.2021.01090</a>.</p>
1304+
<p>Hu, Hezhen, Weichao Zhao, Wengang Zhou, Yuechen Wang, and Houqiang Li. 2021. “SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition.” In <em>Proceedings of the Ieee/Cvf International Conference on Computer Vision (Iccv)</em>, 11087–96.</p>
13011305
</div>
13021306
<div id="ref-hu2021NMFAwareSLR">
13031307
<p>Hu, Hezhen, Wengang Zhou, Junfu Pu, and Houqiang Li. 2021. “Global-Local Enhancement Network for NMF-Aware Sign Language Recognition.” <em>ACM Trans. Multimedia Comput. Commun. Appl.</em> 17 (3). <a href="https://doi.org/10.1145/3436754">https://doi.org/10.1145/3436754</a>.</p>
@@ -1506,6 +1510,9 @@ <h2 id="references">References</h2>
15061510
<div id="ref-dataset:oliveiraDatasetIrishSign2017">
15071511
<p>Oliveira, Marlon, Houssem Chatbri, Ylva Ferstl, Mohamed Farouk, Suzanne Little, Noel O’Connor, and A. Sutherland. 2017. “A Dataset for Irish Sign Language Recognition.” In <em>Proceedings of the Irish Machine Vision and Image Processing Conference (IMVIP)</em>.</p>
15081512
</div>
1513+
<div id="ref-van_den_Oord_2017NeuralDiscreteRepresentationLearning">
1514+
<p>Oord, Aaron van den, Oriol Vinyals, and Koray Kavukcuoglu. 2017. “Neural Discrete Representation Learning.” In <em>Proceedings of the 31st International Conference on Neural Information Processing Systems</em>, 6309–18. NIPS’17. Red Hook, NY, USA: Curran Associates Inc.</p>
1515+
</div>
15091516
<div id="ref-ormel2012prosodic">
15101517
<p>Ormel, Ellen, and Onno Crasborn. 2012. “Prosodic Correlates of Sentences in Signed Languages: A Literature Review and Suggestions for New Types of Studies.” <em>Sign Language Studies</em> 12 (2): 279–315.</p>
15111518
</div>
@@ -1785,6 +1792,9 @@ <h2 id="references">References</h2>
17851792
<div id="ref-Zhao_Zhang_Fu_Hu_Su_Chen_2024">
17861793
<p>Zhao, Rui, Liang Zhang, Biao Fu, Cong Hu, Jinsong Su, and Yidong Chen. 2024. “Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment.” <em>Proceedings of the AAAI Conference on Artificial Intelligence</em> 38 (17): 19643–51. <a href="https://doi.org/10.1609/aaai.v38i17.29937">https://doi.org/10.1609/aaai.v38i17.29937</a>.</p>
17871794
</div>
1795+
<div id="ref-Zhao2023BESTPretrainingSignLanguageRecognition">
1796+
<p>Zhao, Weichao, Hezhen Hu, Wengang Zhou, Jiaxin Shi, and Houqiang Li. 2023. “BEST: BERT Pre-Training for Sign Language Recognition with Coupling Tokenization.” <em>Proceedings of the AAAI Conference on Artificial Intelligence</em> 37 (3): 3597–3605. <a href="https://doi.org/10.1609/aaai.v37i3.25470">https://doi.org/10.1609/aaai.v37i3.25470</a>.</p>
1797+
</div>
17881798
<div id="ref-zwitserlood2004synthetic">
17891799
<p>Zwitserlood, Inge, Margriet Verlinden, Johan Ros, Sanny Van Der Schoot, and T Netherlands. 2004. “Synthetic Signing for the Deaf: Esign.” In <em>Proceedings of the Conference and Workshop on Assistive Technologies for Vision and Hearing Impairment, Cvhi</em>.</p>
17901800
</div>

index.md

+9-2
Original file line numberDiff line numberDiff line change
@@ -957,14 +957,21 @@ Finally, they used this information to construct an animation system using lette
957957

958958
In this paradigm, rather than targeting a specific task (e.g. pose-to-text), the aim is to learn a generally-useful Sign Language Understanding model or representation which can be applied or finetuned to specific downstream tasks.
959959

960-
<!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: talk about BEST here. Results are not as good as SignBERT+ but the do some things differently. Compare/contrast things like: left+right+body triplets in BEST vs left+right only in SignBERT+ -->
961-
962960
@hu2023SignBertPlus introduce SignBERT+, a self-supervised pretraining method for sign language understanding (SLU) based on masked modeling of pose sequences.
963961
This is an extension of their earlier SignBERT [@hu2021SignBert], with several improvements.
964962
For pretraining they extract pose sequences from over 230k videos using MMPose [@mmpose2020].
965963
They then perform multi-level masked modeling (joints, frames, clips) on these sequences, integrating a statistical hand model [@romero2017MANOHandModel] to constrain the decoder's predictions for anatomical realism and enhanced accuracy.
966964
Validation on isolated SLR (MS-ASL [@dataset:joze2018ms], WLASL [@dataset:li2020word], SLR500 [@huang2019attention3DCNNsSLR]), continuous SLR (RWTH-PHOENIX-Weather [@koller2015ContinuousSLR]), and SLT (RWTH-PHOENIX-Weather 2014T [@dataset:forster2014extensions;@cihan2018neural]) demonstrates state-of-the-art performance.
967965

966+
<!-- BEST seems to be **B**ERT pre-training for **S**ign language recognition with coupling **T**okenization -->
967+
@Zhao2023BESTPretrainingSignLanguageRecognition introduce BEST (BERT Pre-training for Sign Language Recognition with Coupling Tokenization), a pre-training method based on masked modeling of pose sequences using a coupled tokenization scheme.
968+
This method takes pose triplet units (left hand, right hand, and upper-body with arms) as inputs, each tokenized into discrete codes [@van_den_Oord_2017NeuralDiscreteRepresentationLearning] that are then coupled together.
969+
Masked modeling is then applied, where any or all components of the triplet (left hand, right hand, or upper-body) may be masked, to learn hierarchical correlations among them.
970+
Unlike @hu2023SignBertPlus, BEST does not mask multi-frame pose sequences or individual joints.
971+
The authors validate their pre-training method on isolated sign recognition (ISR) tasks using MS-ASL [@dataset:joze2018ms], WLASL [@dataset:li2020word], SLR500 [@huang2019attention3DCNNsSLR], and NMFs-CSL [@hu2021NMFAwareSLR].
972+
Besides pose-to-gloss, they also experiment with video-to-gloss tasks via fusion with I3D [@carreira2017quo].
973+
Results on these datasets demonstrate state-of-the-art performance compared to previous methods and are comparable to those of SignBERT+ [@hu2023SignBertPlus].
974+
968975
## Annotation Tools
969976

970977
##### ELAN - EUDICO Linguistic Annotator

sitemap.xml

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88

99
<url>
1010
<loc>https://sign-language-processing.github.io/</loc>
11-
<lastmod>2024-06-11T09:46:11+00:00</lastmod>
11+
<lastmod>2024-06-12T07:25:25+00:00</lastmod>
1212
</url>
1313

1414

0 commit comments

Comments
 (0)