Skip to content

Commit 45050e0

Browse files
committed
Deploying to gh-pages from @ 1525580 🚀
1 parent 81d976c commit 45050e0

File tree

3 files changed

+7
-7
lines changed

3 files changed

+7
-7
lines changed

index.html

+3-3
Original file line numberDiff line numberDiff line change
@@ -325,7 +325,7 @@ <h6 class="unnumbered">SiMAX <span class="citation" data-cites="SiMAX_2020">(<sp
325325
<p>is a software application developed to transform textual input into 3D animated sign language representations. Utilizing a comprehensive database and the expertise of deaf sign language professionals, SiMAX ensures accurate translations of both written and spoken content. The process begins with the generation of a translation suggestion, which is subsequently reviewed and, if necessary, modified by deaf translators to ensure accuracy and cultural appropriateness. These translations are carried out by a customizable digital avatar that can be adapted to reflect the corporate identity or target audience of the user. This approach offers a cost-effective alternative to traditional sign language video production, as it eliminates the need for expensive film studios and complex video technology typically associated with such productions.</p>
326326
</section>
327327
<h5 id="image-and-video-generation-models">Image and Video Generation Models</h5>
328-
<p>Most recently in the field of image and video generation, there have been notable advances in methods such as Style-Based Generator Architecture for Generative Adversarial Networks <span class="citation" data-cites="style-to-image:Karras2018ASG">(Karras, Laine, and Aila <a href="#ref-style-to-image:Karras2018ASG" role="doc-biblioref">2019</a>, @style–to–image:Karras2019stylegan2, @style–to–image:Karras2021)</span>, Variational Diffusion Models <span class="citation" data-cites="text-to-image:Kingma2021VariationalDM">(Kingma et al. <a href="#ref-text-to-image:Kingma2021VariationalDM" role="doc-biblioref">2021</a>)</span>, High-Resolution Image Synthesis with Latent Diffusion Models <span class="citation" data-cites="text-to-image:Rombach2021HighResolutionIS">(Rombach et al. <a href="#ref-text-to-image:Rombach2021HighResolutionIS" role="doc-biblioref">2021</a>)</span>, High Definition Video Generation with Diffusion Models <span class="citation" data-cites="text-to-video:Ho2022ImagenVH">(Ho et al. <a href="#ref-text-to-video:Ho2022ImagenVH" role="doc-biblioref">2022</a>)</span>, and High-Resolution Video Synthesis with Latent Diffusion Models <span class="citation" data-cites="text-to-video:blattmann2023videoldm">(Blattmann et al. <a href="#ref-text-to-video:blattmann2023videoldm" role="doc-biblioref">2023</a>)</span>. These methods have significantly improved image and video synthesis quality, providing stunningly realistic and visually appealing results.</p>
328+
<p>Most recently in the field of image and video generation, there have been notable advances in methods such as Style-Based Generator Architecture for Generative Adversarial Networks <span class="citation" data-cites="style-to-image:Karras2018ASG style-to-image:Karras2019stylegan2 style-to-image:Karras2021">(Karras, Laine, and Aila <a href="#ref-style-to-image:Karras2018ASG" role="doc-biblioref">2019</a>; Karras et al. <a href="#ref-style-to-image:Karras2019stylegan2" role="doc-biblioref">2020</a>, <a href="#ref-style-to-image:Karras2021" role="doc-biblioref">2021</a>)</span>, Variational Diffusion Models <span class="citation" data-cites="text-to-image:Kingma2021VariationalDM">(Kingma et al. <a href="#ref-text-to-image:Kingma2021VariationalDM" role="doc-biblioref">2021</a>)</span>, High-Resolution Image Synthesis with Latent Diffusion Models <span class="citation" data-cites="text-to-image:Rombach2021HighResolutionIS">(Rombach et al. <a href="#ref-text-to-image:Rombach2021HighResolutionIS" role="doc-biblioref">2021</a>)</span>, High Definition Video Generation with Diffusion Models <span class="citation" data-cites="text-to-video:Ho2022ImagenVH">(Ho et al. <a href="#ref-text-to-video:Ho2022ImagenVH" role="doc-biblioref">2022</a>)</span>, and High-Resolution Video Synthesis with Latent Diffusion Models <span class="citation" data-cites="text-to-video:blattmann2023videoldm">(Blattmann et al. <a href="#ref-text-to-video:blattmann2023videoldm" role="doc-biblioref">2023</a>)</span>. These methods have significantly improved image and video synthesis quality, providing stunningly realistic and visually appealing results.</p>
329329
<p>However, despite their remarkable progress in generating high-quality images and videos, these models trade-off computational efficiency. The complexity of these algorithms often results in slower inference times, making real-time applications challenging. On-device deployment of these models provides benefits such as lower server costs, offline functionality, and improved user privacy. While compute-aware optimizations, specifically targeting hardware capabilities of different devices, could improve the inference latency of these models, <span class="citation" data-cites="Chen2023SpeedIA">Chen et al. (<a href="#ref-Chen2023SpeedIA" role="doc-biblioref">2023</a>)</span> found that optimizing such models on top-of-the-line mobile devices such as the Samsung S23 Ultra or iPhone 14 Pro Max can decrease per-frame inference latency from around 23 seconds to around 12.</p>
330330
<p>ControlNet <span class="citation" data-cites="pose-to-image:zhang2023adding">(L. Zhang and Agrawala <a href="#ref-pose-to-image:zhang2023adding" role="doc-biblioref">2023</a>)</span> recently presented a neural network structure for controlling pretrained large diffusion models with additional input conditions. This approach enables end-to-end learning of task-specific conditions, even with a small training dataset. Training a ControlNet is as fast as fine-tuning a diffusion model and can be executed on personal devices or scaled to large amounts of data using powerful computation clusters. ControlNet has been demonstrated to augment large diffusion models like Stable Diffusion with conditional inputs such as edge maps, segmentation maps, and keypoints. One of the applications of ControlNet is pose-to-image translation control, which allows the generation of images based on pose information. Although this method has shown promising results, it still requires retraining the model and does not inherently support temporal coherency, which is important for tasks like sign language translation.</p>
331331
<p>In the near future, we can expect many works on controlling video diffusion models directly from text for sign language translation. These models will likely generate visually appealing and realistic videos. However, they may still make mistakes and be limited to scenarios with more training data available. Developing models that can accurately generate sign language videos from text or pose information while maintaining visual quality and temporal coherency will be essential for advancing the field of sign language production.</p>
@@ -382,7 +382,7 @@ <h4 id="video-to-text">Video-to-Text</h4>
382382
<!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: the "previous gloss-free frameworks" that gongLLMsAreGood2024 cite are: Gloss Attention for Gloss-free Sign Language Translation (2023) and Gloss-free sign language translation: Improving from visual-language pretraining, 2023 aka GFSLT-VLP. Could be good to lead into it with explanations of those? -->
383383
<p><span class="citation" data-cites="gongLLMsAreGood2024">Gong et al. (<a href="#ref-gongLLMsAreGood2024" role="doc-biblioref">2024</a>)</span> introduce SignLLM, a framework for gloss-free sign language translation that leverages the strengths of Large Language Models (LLMs). SignLLM converts sign videos into discrete and hierarchical representations compatible with LLMs through two modules: (1) The Vector-Quantized Visual Sign (VQ-Sign) module, which translates sign videos into discrete “character-level” tokens, and (2) the Codebook Reconstruction and Alignment (CRA) module, which restructures these tokens into “word-level” representations. During inference, the “word-level” tokens are projected into the LLM’s embedding space, which is then prompted for translation. The LLM itself can be taken “off the shelf” and does not need to be trained. In training, the VQ-Sign “character-level” module is trained with a context prediction task, the CRA “word-level” module with an optimal transport technique, and a sign-text alignment loss further enhances the semantic alignment between sign and text tokens. The framework achieves state-of-the-art results on the RWTH-PHOENIX-Weather-2014T <span class="citation" data-cites="cihan2018neural">(Camgöz et al. <a href="#ref-cihan2018neural" role="doc-biblioref">2018</a>)</span> and CSL-Daily <span class="citation" data-cites="dataset:huang2018video">(Huang et al. <a href="#ref-dataset:huang2018video" role="doc-biblioref">2018</a>)</span> datasets without relying on gloss annotations. <!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: c.f. SignLLM with https://github.com/sign-language-processing/sign-vq? --></p>
384384
<!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: YoutubeASL explanation would fit nicely here before Rust et al 2024. They don't just do data IIRC. -->
385-
<p><span class="citation" data-cites="rust2024PrivacyAwareSign">Rust et al. (<a href="#ref-rust2024PrivacyAwareSign" role="doc-biblioref">2024</a>)</span> introduce a two-stage privacy-aware method for sign language translation (SLT) at scale, termed Self-Supervised Video Pretraining for Sign Language Translation (SSVP-SLT). The first stage involves self-supervised pretraining of a Hiera vision transformer <span class="citation" data-cites="ryali2023HieraVisionTransformer">(Ryali et al. <a href="#ref-ryali2023HieraVisionTransformer" role="doc-biblioref">2023</a>)</span> on large unannotated video datasets <span class="citation" data-cites="dataset:duarte2020how2sign">(Duarte et al. <a href="#ref-dataset:duarte2020how2sign" role="doc-biblioref">2021</a>, @dataset:uthus2023YoutubeASL)</span>. In the second stage, the vision model’s outputs are fed into a multilingual language model <span class="citation" data-cites="raffel2020T5Transformer">(Raffel et al. <a href="#ref-raffel2020T5Transformer" role="doc-biblioref">2020</a>)</span> for finetuning on the How2Sign dataset <span class="citation" data-cites="dataset:duarte2020how2sign">(Duarte et al. <a href="#ref-dataset:duarte2020how2sign" role="doc-biblioref">2021</a>)</span>. To mitigate privacy risks, the framework employs facial blurring during pretraining. They find that while pretraining with blurring hurts performance, some can be recovered when finetuning with unblurred data. SSVP-SLT achieves state-of-the-art performance on How2Sign <span class="citation" data-cites="dataset:duarte2020how2sign">(Duarte et al. <a href="#ref-dataset:duarte2020how2sign" role="doc-biblioref">2021</a>)</span>. They conclude that SLT models can be pretrained in a privacy-aware manner without sacrificing too much performance. Additionally, the authors release DailyMoth-70h, a new 70-hour ASL dataset from <a href="https://www.dailymoth.com/">The Daily Moth</a>.</p>
385+
<p><span class="citation" data-cites="rust2024PrivacyAwareSign">Rust et al. (<a href="#ref-rust2024PrivacyAwareSign" role="doc-biblioref">2024</a>)</span> introduce a two-stage privacy-aware method for sign language translation (SLT) at scale, termed Self-Supervised Video Pretraining for Sign Language Translation (SSVP-SLT). The first stage involves self-supervised pretraining of a Hiera vision transformer <span class="citation" data-cites="ryali2023HieraVisionTransformer">(Ryali et al. <a href="#ref-ryali2023HieraVisionTransformer" role="doc-biblioref">2023</a>)</span> on large unannotated video datasets <span class="citation" data-cites="dataset:duarte2020how2sign dataset:uthus2023YoutubeASL">(Duarte et al. <a href="#ref-dataset:duarte2020how2sign" role="doc-biblioref">2021</a>; Uthus, Tanzer, and Georg <a href="#ref-dataset:uthus2023YoutubeASL" role="doc-biblioref">2023</a>)</span>. In the second stage, the vision model’s outputs are fed into a multilingual language model <span class="citation" data-cites="raffel2020T5Transformer">(Raffel et al. <a href="#ref-raffel2020T5Transformer" role="doc-biblioref">2020</a>)</span> for finetuning on the How2Sign dataset <span class="citation" data-cites="dataset:duarte2020how2sign">(Duarte et al. <a href="#ref-dataset:duarte2020how2sign" role="doc-biblioref">2021</a>)</span>. To mitigate privacy risks, the framework employs facial blurring during pretraining. They find that while pretraining with blurring hurts performance, some can be recovered when finetuning with unblurred data. SSVP-SLT achieves state-of-the-art performance on How2Sign <span class="citation" data-cites="dataset:duarte2020how2sign">(Duarte et al. <a href="#ref-dataset:duarte2020how2sign" role="doc-biblioref">2021</a>)</span>. They conclude that SLT models can be pretrained in a privacy-aware manner without sacrificing too much performance. Additionally, the authors release DailyMoth-70h, a new 70-hour ASL dataset from <a href="https://www.dailymoth.com/">The Daily Moth</a>.</p>
386386
<!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: BLEURT explanation -->
387387
<!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: add DailyMoth to datasets list. Table 8 has stats: 497 videos, 70 hours, 1 signer, vocabulary of words 19 740, segmented video clips, -->
388388
<!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: AFRISIGN (Shester and Mathias at AfricaNLP, ICLR 2023 workshop) -->
@@ -454,7 +454,7 @@ <h6 class="unnumbered">Bilingual dictionaries</h6>
454454
</section>
455455
<section id="fingerspelling-corpora" class="unnumbered">
456456
<h6 class="unnumbered">Fingerspelling corpora</h6>
457-
<p>usually consist of videos of words borrowed from spoken languages that are signed letter-by-letter. They can be synthetically created <span class="citation" data-cites="dataset:dreuw2006modeling">(Dreuw et al. <a href="#ref-dataset:dreuw2006modeling" role="doc-biblioref">2006</a>)</span> or mined from online resources <span class="citation" data-cites="dataset:fs18slt">(Shi et al. <a href="#ref-dataset:fs18slt" role="doc-biblioref">2018</a>, @dataset:fs18iccv)</span>. However, they only capture one aspect of signed languages.</p>
457+
<p>usually consist of videos of words borrowed from spoken languages that are signed letter-by-letter. They can be synthetically created <span class="citation" data-cites="dataset:dreuw2006modeling">(Dreuw et al. <a href="#ref-dataset:dreuw2006modeling" role="doc-biblioref">2006</a>)</span> or mined from online resources <span class="citation" data-cites="dataset:fs18slt dataset:fs18iccv">(Shi et al. <a href="#ref-dataset:fs18slt" role="doc-biblioref">2018</a>, <a href="#ref-dataset:fs18iccv" role="doc-biblioref">2019</a>)</span>. However, they only capture one aspect of signed languages.</p>
458458
</section>
459459
<section id="isolated-sign-corpora" class="unnumbered">
460460
<h6 class="unnumbered">Isolated sign corpora</h6>

index.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -498,7 +498,7 @@ the need for expensive film studios and complex video technology typically assoc
498498

499499
Most recently in the field of image and video generation,
500500
there have been notable advances in methods such as
501-
Style-Based Generator Architecture for Generative Adversarial Networks [@style-to-image:Karras2018ASG,@style-to-image:Karras2019stylegan2,@style-to-image:Karras2021],
501+
Style-Based Generator Architecture for Generative Adversarial Networks [@style-to-image:Karras2018ASG;@style-to-image:Karras2019stylegan2;@style-to-image:Karras2021],
502502
Variational Diffusion Models [@text-to-image:Kingma2021VariationalDM],
503503
High-Resolution Image Synthesis with Latent Diffusion Models [@text-to-image:Rombach2021HighResolutionIS],
504504
High Definition Video Generation with Diffusion Models [@text-to-video:Ho2022ImagenVH], and
@@ -775,7 +775,7 @@ The framework achieves state-of-the-art results on the RWTH-PHOENIX-Weather-2014
775775
<!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: YoutubeASL explanation would fit nicely here before Rust et al 2024. They don't just do data IIRC. -->
776776

777777
@rust2024PrivacyAwareSign introduce a two-stage privacy-aware method for sign language translation (SLT) at scale, termed Self-Supervised Video Pretraining for Sign Language Translation (SSVP-SLT).
778-
The first stage involves self-supervised pretraining of a Hiera vision transformer [@ryali2023HieraVisionTransformer] on large unannotated video datasets [@dataset:duarte2020how2sign, @dataset:uthus2023YoutubeASL].
778+
The first stage involves self-supervised pretraining of a Hiera vision transformer [@ryali2023HieraVisionTransformer] on large unannotated video datasets [@dataset:duarte2020how2sign;@dataset:uthus2023YoutubeASL].
779779
In the second stage, the vision model's outputs are fed into a multilingual language model [@raffel2020T5Transformer] for finetuning on the How2Sign dataset [@dataset:duarte2020how2sign].
780780
To mitigate privacy risks, the framework employs facial blurring during pretraining.
781781
They find that while pretraining with blurring hurts performance, some can be recovered when finetuning with unblurred data.
@@ -1027,7 +1027,7 @@ for signed language [@dataset:mesch2012meaning;@fenlon2015building;@crasborn2016
10271027
One notable dictionary, SpreadTheSign\footnote{\url{https://www.spreadthesign.com/}} is a parallel dictionary containing around 25,000 words with up to 42 different spoken-signed language pairs and more than 600,000 videos in total. Unfortunately, while dictionaries may help create lexical rules between languages, they do not demonstrate the grammar or the usage of signs in context.
10281028

10291029
###### Fingerspelling corpora {-}
1030-
usually consist of videos of words borrowed from spoken languages that are signed letter-by-letter. They can be synthetically created [@dataset:dreuw2006modeling] or mined from online resources [@dataset:fs18slt,@dataset:fs18iccv]. However, they only capture one aspect of signed languages.
1030+
usually consist of videos of words borrowed from spoken languages that are signed letter-by-letter. They can be synthetically created [@dataset:dreuw2006modeling] or mined from online resources [@dataset:fs18slt;@dataset:fs18iccv]. However, they only capture one aspect of signed languages.
10311031

10321032
###### Isolated sign corpora {-}
10331033
are collections of annotated single signs. They are synthesized [@dataset:ebling2018smile;@dataset:huang2018video;@dataset:sincan2020autsl;@dataset:hassan-etal-2020-isolated] or mined from online resources [@dataset:joze2018ms;@dataset:li2020word], and can be used for isolated sign language recognition or contrastive analysis of minimal signing pairs [@dataset:imashev2020dataset]. However, like dictionaries, they do not describe relations between signs, nor do they capture coarticulation during the signing, and are often limited in vocabulary size (20-1000 signs).

sitemap.xml

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88

99
<url>
1010
<loc>https://sign-language-processing.github.io/</loc>
11-
<lastmod>2024-06-06T21:55:27+00:00</lastmod>
11+
<lastmod>2024-06-07T19:11:55+00:00</lastmod>
1212
</url>
1313

1414

0 commit comments

Comments
 (0)