You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p>is a software application developed to transform textual input into 3D animated sign language representations. Utilizing a comprehensive database and the expertise of deaf sign language professionals, SiMAX ensures accurate translations of both written and spoken content. The process begins with the generation of a translation suggestion, which is subsequently reviewed and, if necessary, modified by deaf translators to ensure accuracy and cultural appropriateness. These translations are carried out by a customizable digital avatar that can be adapted to reflect the corporate identity or target audience of the user. This approach offers a cost-effective alternative to traditional sign language video production, as it eliminates the need for expensive film studios and complex video technology typically associated with such productions.</p>
326
326
</section>
327
327
<h5id="image-and-video-generation-models">Image and Video Generation Models</h5>
328
-
<p>Most recently in the field of image and video generation, there have been notable advances in methods such as Style-Based Generator Architecture for Generative Adversarial Networks <spanclass="citation" data-cites="style-to-image:Karras2018ASG">(Karras, Laine, and Aila <ahref="#ref-style-to-image:Karras2018ASG" role="doc-biblioref">2019</a>, @style–to–image:Karras2019stylegan2, @style–to–image:Karras2021)</span>, Variational Diffusion Models <spanclass="citation" data-cites="text-to-image:Kingma2021VariationalDM">(Kingma et al. <ahref="#ref-text-to-image:Kingma2021VariationalDM" role="doc-biblioref">2021</a>)</span>, High-Resolution Image Synthesis with Latent Diffusion Models <spanclass="citation" data-cites="text-to-image:Rombach2021HighResolutionIS">(Rombach et al. <ahref="#ref-text-to-image:Rombach2021HighResolutionIS" role="doc-biblioref">2021</a>)</span>, High Definition Video Generation with Diffusion Models <spanclass="citation" data-cites="text-to-video:Ho2022ImagenVH">(Ho et al. <ahref="#ref-text-to-video:Ho2022ImagenVH" role="doc-biblioref">2022</a>)</span>, and High-Resolution Video Synthesis with Latent Diffusion Models <spanclass="citation" data-cites="text-to-video:blattmann2023videoldm">(Blattmann et al. <ahref="#ref-text-to-video:blattmann2023videoldm" role="doc-biblioref">2023</a>)</span>. These methods have significantly improved image and video synthesis quality, providing stunningly realistic and visually appealing results.</p>
328
+
<p>Most recently in the field of image and video generation, there have been notable advances in methods such as Style-Based Generator Architecture for Generative Adversarial Networks <spanclass="citation" data-cites="style-to-image:Karras2018ASG style-to-image:Karras2019stylegan2 style-to-image:Karras2021">(Karras, Laine, and Aila <ahref="#ref-style-to-image:Karras2018ASG" role="doc-biblioref">2019</a>; Karras et al. <ahref="#ref-style-to-image:Karras2019stylegan2" role="doc-biblioref">2020</a>, <ahref="#ref-style-to-image:Karras2021" role="doc-biblioref">2021</a>)</span>, Variational Diffusion Models <spanclass="citation" data-cites="text-to-image:Kingma2021VariationalDM">(Kingma et al. <ahref="#ref-text-to-image:Kingma2021VariationalDM" role="doc-biblioref">2021</a>)</span>, High-Resolution Image Synthesis with Latent Diffusion Models <spanclass="citation" data-cites="text-to-image:Rombach2021HighResolutionIS">(Rombach et al. <ahref="#ref-text-to-image:Rombach2021HighResolutionIS" role="doc-biblioref">2021</a>)</span>, High Definition Video Generation with Diffusion Models <spanclass="citation" data-cites="text-to-video:Ho2022ImagenVH">(Ho et al. <ahref="#ref-text-to-video:Ho2022ImagenVH" role="doc-biblioref">2022</a>)</span>, and High-Resolution Video Synthesis with Latent Diffusion Models <spanclass="citation" data-cites="text-to-video:blattmann2023videoldm">(Blattmann et al. <ahref="#ref-text-to-video:blattmann2023videoldm" role="doc-biblioref">2023</a>)</span>. These methods have significantly improved image and video synthesis quality, providing stunningly realistic and visually appealing results.</p>
329
329
<p>However, despite their remarkable progress in generating high-quality images and videos, these models trade-off computational efficiency. The complexity of these algorithms often results in slower inference times, making real-time applications challenging. On-device deployment of these models provides benefits such as lower server costs, offline functionality, and improved user privacy. While compute-aware optimizations, specifically targeting hardware capabilities of different devices, could improve the inference latency of these models, <spanclass="citation" data-cites="Chen2023SpeedIA">Chen et al. (<ahref="#ref-Chen2023SpeedIA" role="doc-biblioref">2023</a>)</span> found that optimizing such models on top-of-the-line mobile devices such as the Samsung S23 Ultra or iPhone 14 Pro Max can decrease per-frame inference latency from around 23 seconds to around 12.</p>
330
330
<p>ControlNet <spanclass="citation" data-cites="pose-to-image:zhang2023adding">(L. Zhang and Agrawala <ahref="#ref-pose-to-image:zhang2023adding" role="doc-biblioref">2023</a>)</span> recently presented a neural network structure for controlling pretrained large diffusion models with additional input conditions. This approach enables end-to-end learning of task-specific conditions, even with a small training dataset. Training a ControlNet is as fast as fine-tuning a diffusion model and can be executed on personal devices or scaled to large amounts of data using powerful computation clusters. ControlNet has been demonstrated to augment large diffusion models like Stable Diffusion with conditional inputs such as edge maps, segmentation maps, and keypoints. One of the applications of ControlNet is pose-to-image translation control, which allows the generation of images based on pose information. Although this method has shown promising results, it still requires retraining the model and does not inherently support temporal coherency, which is important for tasks like sign language translation.</p>
331
331
<p>In the near future, we can expect many works on controlling video diffusion models directly from text for sign language translation. These models will likely generate visually appealing and realistic videos. However, they may still make mistakes and be limited to scenarios with more training data available. Developing models that can accurately generate sign language videos from text or pose information while maintaining visual quality and temporal coherency will be essential for advancing the field of sign language production.</p>
<!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: the "previous gloss-free frameworks" that gongLLMsAreGood2024 cite are: Gloss Attention for Gloss-free Sign Language Translation (2023) and Gloss-free sign language translation: Improving from visual-language pretraining, 2023 aka GFSLT-VLP. Could be good to lead into it with explanations of those? -->
383
383
<p><spanclass="citation" data-cites="gongLLMsAreGood2024">Gong et al. (<ahref="#ref-gongLLMsAreGood2024" role="doc-biblioref">2024</a>)</span> introduce SignLLM, a framework for gloss-free sign language translation that leverages the strengths of Large Language Models (LLMs). SignLLM converts sign videos into discrete and hierarchical representations compatible with LLMs through two modules: (1) The Vector-Quantized Visual Sign (VQ-Sign) module, which translates sign videos into discrete “character-level” tokens, and (2) the Codebook Reconstruction and Alignment (CRA) module, which restructures these tokens into “word-level” representations. During inference, the “word-level” tokens are projected into the LLM’s embedding space, which is then prompted for translation. The LLM itself can be taken “off the shelf” and does not need to be trained. In training, the VQ-Sign “character-level” module is trained with a context prediction task, the CRA “word-level” module with an optimal transport technique, and a sign-text alignment loss further enhances the semantic alignment between sign and text tokens. The framework achieves state-of-the-art results on the RWTH-PHOENIX-Weather-2014T <spanclass="citation" data-cites="cihan2018neural">(Camgöz et al. <ahref="#ref-cihan2018neural" role="doc-biblioref">2018</a>)</span> and CSL-Daily <spanclass="citation" data-cites="dataset:huang2018video">(Huang et al. <ahref="#ref-dataset:huang2018video" role="doc-biblioref">2018</a>)</span> datasets without relying on gloss annotations. <!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: c.f. SignLLM with https://github.com/sign-language-processing/sign-vq? --></p>
384
384
<!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: YoutubeASL explanation would fit nicely here before Rust et al 2024. They don't just do data IIRC. -->
385
-
<p><spanclass="citation" data-cites="rust2024PrivacyAwareSign">Rust et al. (<ahref="#ref-rust2024PrivacyAwareSign" role="doc-biblioref">2024</a>)</span> introduce a two-stage privacy-aware method for sign language translation (SLT) at scale, termed Self-Supervised Video Pretraining for Sign Language Translation (SSVP-SLT). The first stage involves self-supervised pretraining of a Hiera vision transformer <spanclass="citation" data-cites="ryali2023HieraVisionTransformer">(Ryali et al. <ahref="#ref-ryali2023HieraVisionTransformer" role="doc-biblioref">2023</a>)</span> on large unannotated video datasets <spanclass="citation" data-cites="dataset:duarte2020how2sign">(Duarte et al. <ahref="#ref-dataset:duarte2020how2sign" role="doc-biblioref">2021</a>, @dataset:uthus2023YoutubeASL)</span>. In the second stage, the vision model’s outputs are fed into a multilingual language model <spanclass="citation" data-cites="raffel2020T5Transformer">(Raffel et al. <ahref="#ref-raffel2020T5Transformer" role="doc-biblioref">2020</a>)</span> for finetuning on the How2Sign dataset <spanclass="citation" data-cites="dataset:duarte2020how2sign">(Duarte et al. <ahref="#ref-dataset:duarte2020how2sign" role="doc-biblioref">2021</a>)</span>. To mitigate privacy risks, the framework employs facial blurring during pretraining. They find that while pretraining with blurring hurts performance, some can be recovered when finetuning with unblurred data. SSVP-SLT achieves state-of-the-art performance on How2Sign <spanclass="citation" data-cites="dataset:duarte2020how2sign">(Duarte et al. <ahref="#ref-dataset:duarte2020how2sign" role="doc-biblioref">2021</a>)</span>. They conclude that SLT models can be pretrained in a privacy-aware manner without sacrificing too much performance. Additionally, the authors release DailyMoth-70h, a new 70-hour ASL dataset from <ahref="https://www.dailymoth.com/">The Daily Moth</a>.</p>
385
+
<p><spanclass="citation" data-cites="rust2024PrivacyAwareSign">Rust et al. (<ahref="#ref-rust2024PrivacyAwareSign" role="doc-biblioref">2024</a>)</span> introduce a two-stage privacy-aware method for sign language translation (SLT) at scale, termed Self-Supervised Video Pretraining for Sign Language Translation (SSVP-SLT). The first stage involves self-supervised pretraining of a Hiera vision transformer <spanclass="citation" data-cites="ryali2023HieraVisionTransformer">(Ryali et al. <ahref="#ref-ryali2023HieraVisionTransformer" role="doc-biblioref">2023</a>)</span> on large unannotated video datasets <spanclass="citation" data-cites="dataset:duarte2020how2sign dataset:uthus2023YoutubeASL">(Duarte et al. <ahref="#ref-dataset:duarte2020how2sign" role="doc-biblioref">2021</a>; Uthus, Tanzer, and Georg <ahref="#ref-dataset:uthus2023YoutubeASL" role="doc-biblioref">2023</a>)</span>. In the second stage, the vision model’s outputs are fed into a multilingual language model <spanclass="citation" data-cites="raffel2020T5Transformer">(Raffel et al. <ahref="#ref-raffel2020T5Transformer" role="doc-biblioref">2020</a>)</span> for finetuning on the How2Sign dataset <spanclass="citation" data-cites="dataset:duarte2020how2sign">(Duarte et al. <ahref="#ref-dataset:duarte2020how2sign" role="doc-biblioref">2021</a>)</span>. To mitigate privacy risks, the framework employs facial blurring during pretraining. They find that while pretraining with blurring hurts performance, some can be recovered when finetuning with unblurred data. SSVP-SLT achieves state-of-the-art performance on How2Sign <spanclass="citation" data-cites="dataset:duarte2020how2sign">(Duarte et al. <ahref="#ref-dataset:duarte2020how2sign" role="doc-biblioref">2021</a>)</span>. They conclude that SLT models can be pretrained in a privacy-aware manner without sacrificing too much performance. Additionally, the authors release DailyMoth-70h, a new 70-hour ASL dataset from <ahref="https://www.dailymoth.com/">The Daily Moth</a>.</p>
<p>usually consist of videos of words borrowed from spoken languages that are signed letter-by-letter. They can be synthetically created <spanclass="citation" data-cites="dataset:dreuw2006modeling">(Dreuw et al. <ahref="#ref-dataset:dreuw2006modeling" role="doc-biblioref">2006</a>)</span> or mined from online resources <spanclass="citation" data-cites="dataset:fs18slt">(Shi et al. <ahref="#ref-dataset:fs18slt" role="doc-biblioref">2018</a>, @dataset:fs18iccv)</span>. However, they only capture one aspect of signed languages.</p>
457
+
<p>usually consist of videos of words borrowed from spoken languages that are signed letter-by-letter. They can be synthetically created <spanclass="citation" data-cites="dataset:dreuw2006modeling">(Dreuw et al. <ahref="#ref-dataset:dreuw2006modeling" role="doc-biblioref">2006</a>)</span> or mined from online resources <spanclass="citation" data-cites="dataset:fs18slt dataset:fs18iccv">(Shi et al. <ahref="#ref-dataset:fs18slt" role="doc-biblioref">2018</a>, <ahref="#ref-dataset:fs18iccv" role="doc-biblioref">2019</a>)</span>. However, they only capture one aspect of signed languages.</p>
High-Resolution Image Synthesis with Latent Diffusion Models [@text-to-image:Rombach2021HighResolutionIS],
504
504
High Definition Video Generation with Diffusion Models [@text-to-video:Ho2022ImagenVH], and
@@ -775,7 +775,7 @@ The framework achieves state-of-the-art results on the RWTH-PHOENIX-Weather-2014
775
775
<!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: YoutubeASL explanation would fit nicely here before Rust et al 2024. They don't just do data IIRC. -->
776
776
777
777
@rust2024PrivacyAwareSign introduce a two-stage privacy-aware method for sign language translation (SLT) at scale, termed Self-Supervised Video Pretraining for Sign Language Translation (SSVP-SLT).
778
-
The first stage involves self-supervised pretraining of a Hiera vision transformer [@ryali2023HieraVisionTransformer] on large unannotated video datasets [@dataset:duarte2020how2sign, @dataset:uthus2023YoutubeASL].
778
+
The first stage involves self-supervised pretraining of a Hiera vision transformer [@ryali2023HieraVisionTransformer] on large unannotated video datasets [@dataset:duarte2020how2sign;@dataset:uthus2023YoutubeASL].
779
779
In the second stage, the vision model's outputs are fed into a multilingual language model [@raffel2020T5Transformer] for finetuning on the How2Sign dataset [@dataset:duarte2020how2sign].
780
780
To mitigate privacy risks, the framework employs facial blurring during pretraining.
781
781
They find that while pretraining with blurring hurts performance, some can be recovered when finetuning with unblurred data.
@@ -1027,7 +1027,7 @@ for signed language [@dataset:mesch2012meaning;@fenlon2015building;@crasborn2016
1027
1027
One notable dictionary, SpreadTheSign\footnote{\url{https://www.spreadthesign.com/}} is a parallel dictionary containing around 25,000 words with up to 42 different spoken-signed language pairs and more than 600,000 videos in total. Unfortunately, while dictionaries may help create lexical rules between languages, they do not demonstrate the grammar or the usage of signs in context.
1028
1028
1029
1029
###### Fingerspelling corpora {-}
1030
-
usually consist of videos of words borrowed from spoken languages that are signed letter-by-letter. They can be synthetically created [@dataset:dreuw2006modeling] or mined from online resources [@dataset:fs18slt,@dataset:fs18iccv]. However, they only capture one aspect of signed languages.
1030
+
usually consist of videos of words borrowed from spoken languages that are signed letter-by-letter. They can be synthetically created [@dataset:dreuw2006modeling] or mined from online resources [@dataset:fs18slt;@dataset:fs18iccv]. However, they only capture one aspect of signed languages.
1031
1031
1032
1032
###### Isolated sign corpora {-}
1033
1033
are collections of annotated single signs. They are synthesized [@dataset:ebling2018smile;@dataset:huang2018video;@dataset:sincan2020autsl;@dataset:hassan-etal-2020-isolated] or mined from online resources [@dataset:joze2018ms;@dataset:li2020word], and can be used for isolated sign language recognition or contrastive analysis of minimal signing pairs [@dataset:imashev2020dataset]. However, like dictionaries, they do not describe relations between signs, nor do they capture coarticulation during the signing, and are often limited in vocabulary size (20-1000 signs).
0 commit comments