Deploying to gh-pages from @ 1525580 🚀

AmitMY · AmitMY · commit 45050e068ee2 · 2024-06-07T19:12:02.000Z
diff --git a/index.html b/index.html
@@ -325,7 +325,7 @@ <h6 class="unnumbered">SiMAX <span class="citation" data-cites="SiMAX_2020">(<sp
 <p>is a software application developed to transform textual input into 3D animated sign language representations. Utilizing a comprehensive database and the expertise of deaf sign language professionals, SiMAX ensures accurate translations of both written and spoken content. The process begins with the generation of a translation suggestion, which is subsequently reviewed and, if necessary, modified by deaf translators to ensure accuracy and cultural appropriateness. These translations are carried out by a customizable digital avatar that can be adapted to reflect the corporate identity or target audience of the user. This approach offers a cost-effective alternative to traditional sign language video production, as it eliminates the need for expensive film studios and complex video technology typically associated with such productions.</p>
 </section>
 <h5 id="image-and-video-generation-models">Image and Video Generation Models</h5>
-<p>Most recently in the field of image and video generation, there have been notable advances in methods such as Style-Based Generator Architecture for Generative Adversarial Networks <span class="citation" data-cites="style-to-image:Karras2018ASG">(Karras, Laine, and Aila <a href="#ref-style-to-image:Karras2018ASG" role="doc-biblioref">2019</a>, @style–to–image:Karras2019stylegan2, @style–to–image:Karras2021)</span>, Variational Diffusion Models <span class="citation" data-cites="text-to-image:Kingma2021VariationalDM">(Kingma et al. <a href="#ref-text-to-image:Kingma2021VariationalDM" role="doc-biblioref">2021</a>)</span>, High-Resolution Image Synthesis with Latent Diffusion Models <span class="citation" data-cites="text-to-image:Rombach2021HighResolutionIS">(Rombach et al. <a href="#ref-text-to-image:Rombach2021HighResolutionIS" role="doc-biblioref">2021</a>)</span>, High Definition Video Generation with Diffusion Models <span class="citation" data-cites="text-to-video:Ho2022ImagenVH">(Ho et al. <a href="#ref-text-to-video:Ho2022ImagenVH" role="doc-biblioref">2022</a>)</span>, and High-Resolution Video Synthesis with Latent Diffusion Models <span class="citation" data-cites="text-to-video:blattmann2023videoldm">(Blattmann et al. <a href="#ref-text-to-video:blattmann2023videoldm" role="doc-biblioref">2023</a>)</span>. These methods have significantly improved image and video synthesis quality, providing stunningly realistic and visually appealing results.</p>
+<p>Most recently in the field of image and video generation, there have been notable advances in methods such as Style-Based Generator Architecture for Generative Adversarial Networks <span class="citation" data-cites="style-to-image:Karras2018ASG style-to-image:Karras2019stylegan2 style-to-image:Karras2021">(Karras, Laine, and Aila <a href="#ref-style-to-image:Karras2018ASG" role="doc-biblioref">2019</a>; Karras et al. <a href="#ref-style-to-image:Karras2019stylegan2" role="doc-biblioref">2020</a>, <a href="#ref-style-to-image:Karras2021" role="doc-biblioref">2021</a>)</span>, Variational Diffusion Models <span class="citation" data-cites="text-to-image:Kingma2021VariationalDM">(Kingma et al. <a href="#ref-text-to-image:Kingma2021VariationalDM" role="doc-biblioref">2021</a>)</span>, High-Resolution Image Synthesis with Latent Diffusion Models <span class="citation" data-cites="text-to-image:Rombach2021HighResolutionIS">(Rombach et al. <a href="#ref-text-to-image:Rombach2021HighResolutionIS" role="doc-biblioref">2021</a>)</span>, High Definition Video Generation with Diffusion Models <span class="citation" data-cites="text-to-video:Ho2022ImagenVH">(Ho et al. <a href="#ref-text-to-video:Ho2022ImagenVH" role="doc-biblioref">2022</a>)</span>, and High-Resolution Video Synthesis with Latent Diffusion Models <span class="citation" data-cites="text-to-video:blattmann2023videoldm">(Blattmann et al. <a href="#ref-text-to-video:blattmann2023videoldm" role="doc-biblioref">2023</a>)</span>. These methods have significantly improved image and video synthesis quality, providing stunningly realistic and visually appealing results.</p>
 <p>However, despite their remarkable progress in generating high-quality images and videos, these models trade-off computational efficiency. The complexity of these algorithms often results in slower inference times, making real-time applications challenging. On-device deployment of these models provides benefits such as lower server costs, offline functionality, and improved user privacy. While compute-aware optimizations, specifically targeting hardware capabilities of different devices, could improve the inference latency of these models, <span class="citation" data-cites="Chen2023SpeedIA">Chen et al. (<a href="#ref-Chen2023SpeedIA" role="doc-biblioref">2023</a>)</span> found that optimizing such models on top-of-the-line mobile devices such as the Samsung S23 Ultra or iPhone 14 Pro Max can decrease per-frame inference latency from around 23 seconds to around 12.</p>
 <p>ControlNet <span class="citation" data-cites="pose-to-image:zhang2023adding">(L. Zhang and Agrawala <a href="#ref-pose-to-image:zhang2023adding" role="doc-biblioref">2023</a>)</span> recently presented a neural network structure for controlling pretrained large diffusion models with additional input conditions. This approach enables end-to-end learning of task-specific conditions, even with a small training dataset. Training a ControlNet is as fast as fine-tuning a diffusion model and can be executed on personal devices or scaled to large amounts of data using powerful computation clusters. ControlNet has been demonstrated to augment large diffusion models like Stable Diffusion with conditional inputs such as edge maps, segmentation maps, and keypoints. One of the applications of ControlNet is pose-to-image translation control, which allows the generation of images based on pose information. Although this method has shown promising results, it still requires retraining the model and does not inherently support temporal coherency, which is important for tasks like sign language translation.</p>
 <p>In the near future, we can expect many works on controlling video diffusion models directly from text for sign language translation. These models will likely generate visually appealing and realistic videos. However, they may still make mistakes and be limited to scenarios with more training data available. Developing models that can accurately generate sign language videos from text or pose information while maintaining visual quality and temporal coherency will be essential for advancing the field of sign language production.</p>
@@ -382,7 +382,7 @@ <h4 id="video-to-text">Video-to-Text</h4>
 <!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: the "previous gloss-free frameworks" that gongLLMsAreGood2024 cite are: Gloss Attention for Gloss-free Sign Language Translation (2023) and Gloss-free sign language translation: Improving from visual-language pretraining, 2023 aka GFSLT-VLP. Could be good to lead into it with explanations of those? -->
 <p><span class="citation" data-cites="gongLLMsAreGood2024">Gong et al. (<a href="#ref-gongLLMsAreGood2024" role="doc-biblioref">2024</a>)</span> introduce SignLLM, a framework for gloss-free sign language translation that leverages the strengths of Large Language Models (LLMs). SignLLM converts sign videos into discrete and hierarchical representations compatible with LLMs through two modules: (1) The Vector-Quantized Visual Sign (VQ-Sign) module, which translates sign videos into discrete “character-level” tokens, and (2) the Codebook Reconstruction and Alignment (CRA) module, which restructures these tokens into “word-level” representations. During inference, the “word-level” tokens are projected into the LLM’s embedding space, which is then prompted for translation. The LLM itself can be taken “off the shelf” and does not need to be trained. In training, the VQ-Sign “character-level” module is trained with a context prediction task, the CRA “word-level” module with an optimal transport technique, and a sign-text alignment loss further enhances the semantic alignment between sign and text tokens. The framework achieves state-of-the-art results on the RWTH-PHOENIX-Weather-2014T <span class="citation" data-cites="cihan2018neural">(Camgöz et al. <a href="#ref-cihan2018neural" role="doc-biblioref">2018</a>)</span> and CSL-Daily <span class="citation" data-cites="dataset:huang2018video">(Huang et al. <a href="#ref-dataset:huang2018video" role="doc-biblioref">2018</a>)</span> datasets without relying on gloss annotations. <!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: c.f. SignLLM with https://github.com/sign-language-processing/sign-vq? --></p>
 <!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: YoutubeASL explanation would fit nicely here before Rust et al 2024. They don't just do data IIRC. -->
-<p><span class="citation" data-cites="rust2024PrivacyAwareSign">Rust et al. (<a href="#ref-rust2024PrivacyAwareSign" role="doc-biblioref">2024</a>)</span> introduce a two-stage privacy-aware method for sign language translation (SLT) at scale, termed Self-Supervised Video Pretraining for Sign Language Translation (SSVP-SLT). The first stage involves self-supervised pretraining of a Hiera vision transformer <span class="citation" data-cites="ryali2023HieraVisionTransformer">(Ryali et al. <a href="#ref-ryali2023HieraVisionTransformer" role="doc-biblioref">2023</a>)</span> on large unannotated video datasets <span class="citation" data-cites="dataset:duarte2020how2sign">(Duarte et al. <a href="#ref-dataset:duarte2020how2sign" role="doc-biblioref">2021</a>, @dataset:uthus2023YoutubeASL)</span>. In the second stage, the vision model’s outputs are fed into a multilingual language model <span class="citation" data-cites="raffel2020T5Transformer">(Raffel et al. <a href="#ref-raffel2020T5Transformer" role="doc-biblioref">2020</a>)</span> for finetuning on the How2Sign dataset <span class="citation" data-cites="dataset:duarte2020how2sign">(Duarte et al. <a href="#ref-dataset:duarte2020how2sign" role="doc-biblioref">2021</a>)</span>. To mitigate privacy risks, the framework employs facial blurring during pretraining. They find that while pretraining with blurring hurts performance, some can be recovered when finetuning with unblurred data. SSVP-SLT achieves state-of-the-art performance on How2Sign <span class="citation" data-cites="dataset:duarte2020how2sign">(Duarte et al. <a href="#ref-dataset:duarte2020how2sign" role="doc-biblioref">2021</a>)</span>. They conclude that SLT models can be pretrained in a privacy-aware manner without sacrificing too much performance. Additionally, the authors release DailyMoth-70h, a new 70-hour ASL dataset from <a href="https://www.dailymoth.com/">The Daily Moth</a>.</p>
+<p><span class="citation" data-cites="rust2024PrivacyAwareSign">Rust et al. (<a href="#ref-rust2024PrivacyAwareSign" role="doc-biblioref">2024</a>)</span> introduce a two-stage privacy-aware method for sign language translation (SLT) at scale, termed Self-Supervised Video Pretraining for Sign Language Translation (SSVP-SLT). The first stage involves self-supervised pretraining of a Hiera vision transformer <span class="citation" data-cites="ryali2023HieraVisionTransformer">(Ryali et al. <a href="#ref-ryali2023HieraVisionTransformer" role="doc-biblioref">2023</a>)</span> on large unannotated video datasets <span class="citation" data-cites="dataset:duarte2020how2sign dataset:uthus2023YoutubeASL">(Duarte et al. <a href="#ref-dataset:duarte2020how2sign" role="doc-biblioref">2021</a>; Uthus, Tanzer, and Georg <a href="#ref-dataset:uthus2023YoutubeASL" role="doc-biblioref">2023</a>)</span>. In the second stage, the vision model’s outputs are fed into a multilingual language model <span class="citation" data-cites="raffel2020T5Transformer">(Raffel et al. <a href="#ref-raffel2020T5Transformer" role="doc-biblioref">2020</a>)</span> for finetuning on the How2Sign dataset <span class="citation" data-cites="dataset:duarte2020how2sign">(Duarte et al. <a href="#ref-dataset:duarte2020how2sign" role="doc-biblioref">2021</a>)</span>. To mitigate privacy risks, the framework employs facial blurring during pretraining. They find that while pretraining with blurring hurts performance, some can be recovered when finetuning with unblurred data. SSVP-SLT achieves state-of-the-art performance on How2Sign <span class="citation" data-cites="dataset:duarte2020how2sign">(Duarte et al. <a href="#ref-dataset:duarte2020how2sign" role="doc-biblioref">2021</a>)</span>. They conclude that SLT models can be pretrained in a privacy-aware manner without sacrificing too much performance. Additionally, the authors release DailyMoth-70h, a new 70-hour ASL dataset from <a href="https://www.dailymoth.com/">The Daily Moth</a>.</p>
 <!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: BLEURT explanation -->
 <!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: add DailyMoth to datasets list. Table 8 has stats: 497 videos, 70 hours, 1 signer, vocabulary of words 19 740, segmented video clips, -->
 <!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: AFRISIGN (Shester and Mathias at AfricaNLP, ICLR 2023 workshop) -->
@@ -454,7 +454,7 @@ <h6 class="unnumbered">Bilingual dictionaries</h6>
 </section>
 <section id="fingerspelling-corpora" class="unnumbered">
 <h6 class="unnumbered">Fingerspelling corpora</h6>
-<p>usually consist of videos of words borrowed from spoken languages that are signed letter-by-letter. They can be synthetically created <span class="citation" data-cites="dataset:dreuw2006modeling">(Dreuw et al. <a href="#ref-dataset:dreuw2006modeling" role="doc-biblioref">2006</a>)</span> or mined from online resources <span class="citation" data-cites="dataset:fs18slt">(Shi et al. <a href="#ref-dataset:fs18slt" role="doc-biblioref">2018</a>, @dataset:fs18iccv)</span>. However, they only capture one aspect of signed languages.</p>
+<p>usually consist of videos of words borrowed from spoken languages that are signed letter-by-letter. They can be synthetically created <span class="citation" data-cites="dataset:dreuw2006modeling">(Dreuw et al. <a href="#ref-dataset:dreuw2006modeling" role="doc-biblioref">2006</a>)</span> or mined from online resources <span class="citation" data-cites="dataset:fs18slt dataset:fs18iccv">(Shi et al. <a href="#ref-dataset:fs18slt" role="doc-biblioref">2018</a>, <a href="#ref-dataset:fs18iccv" role="doc-biblioref">2019</a>)</span>. However, they only capture one aspect of signed languages.</p>
 </section>
 <section id="isolated-sign-corpora" class="unnumbered">
 <h6 class="unnumbered">Isolated sign corpora</h6>
diff --git a/index.md b/index.md
@@ -498,7 +498,7 @@ the need for expensive film studios and complex video technology typically assoc
 
 Most recently in the field of image and video generation, 
 there have been notable advances in methods such as 
-Style-Based Generator Architecture for Generative Adversarial Networks [@style-to-image:Karras2018ASG,@style-to-image:Karras2019stylegan2,@style-to-image:Karras2021], 
+Style-Based Generator Architecture for Generative Adversarial Networks [@style-to-image:Karras2018ASG;@style-to-image:Karras2019stylegan2;@style-to-image:Karras2021], 
 Variational Diffusion Models [@text-to-image:Kingma2021VariationalDM], 
 High-Resolution Image Synthesis with Latent Diffusion Models [@text-to-image:Rombach2021HighResolutionIS], 
 High Definition Video Generation with Diffusion Models [@text-to-video:Ho2022ImagenVH], and 
@@ -775,7 +775,7 @@ The framework achieves state-of-the-art results on the RWTH-PHOENIX-Weather-2014
 <!-- <span style="background-color: red; color: white; padding: 0 2px !important;">**TODO**</span>: YoutubeASL explanation would fit nicely here before Rust et al 2024. They don't just do data IIRC. -->
 
 @rust2024PrivacyAwareSign introduce a two-stage privacy-aware method for sign language translation (SLT) at scale, termed Self-Supervised Video Pretraining for Sign Language Translation (SSVP-SLT). 
-The first stage involves self-supervised pretraining of a Hiera vision transformer [@ryali2023HieraVisionTransformer] on large unannotated video datasets [@dataset:duarte2020how2sign, @dataset:uthus2023YoutubeASL]. 
+The first stage involves self-supervised pretraining of a Hiera vision transformer [@ryali2023HieraVisionTransformer] on large unannotated video datasets [@dataset:duarte2020how2sign;@dataset:uthus2023YoutubeASL]. 
 In the second stage, the vision model's outputs are fed into a multilingual language model [@raffel2020T5Transformer] for finetuning on the How2Sign dataset [@dataset:duarte2020how2sign].
 To mitigate privacy risks, the framework employs facial blurring during pretraining.
 They find that while pretraining with blurring hurts performance, some can be recovered when finetuning with unblurred data.
@@ -1027,7 +1027,7 @@ for signed language [@dataset:mesch2012meaning;@fenlon2015building;@crasborn2016
 One notable dictionary, SpreadTheSign\footnote{\url{https://www.spreadthesign.com/}} is a parallel dictionary containing around 25,000 words with up to 42 different spoken-signed language pairs and more than 600,000 videos in total. Unfortunately, while dictionaries may help create lexical rules between languages, they do not demonstrate the grammar or the usage of signs in context.
 
 ###### Fingerspelling corpora {-}
-usually consist of videos of words borrowed from spoken languages that are signed letter-by-letter. They can be synthetically created [@dataset:dreuw2006modeling] or mined from online resources [@dataset:fs18slt,@dataset:fs18iccv]. However, they only capture one aspect of signed languages.
+usually consist of videos of words borrowed from spoken languages that are signed letter-by-letter. They can be synthetically created [@dataset:dreuw2006modeling] or mined from online resources [@dataset:fs18slt;@dataset:fs18iccv]. However, they only capture one aspect of signed languages.
 
 ###### Isolated sign corpora {-}
 are collections of annotated single signs. They are synthesized [@dataset:ebling2018smile;@dataset:huang2018video;@dataset:sincan2020autsl;@dataset:hassan-etal-2020-isolated] or mined from online resources [@dataset:joze2018ms;@dataset:li2020word], and can be used for isolated sign language recognition or contrastive analysis of minimal signing pairs [@dataset:imashev2020dataset]. However, like dictionaries, they do not describe relations between signs, nor do they capture coarticulation during the signing, and are often limited in vocabulary size (20-1000 signs).
diff --git a/sitemap.xml b/sitemap.xml
@@ -8,7 +8,7 @@
 
 <url>
   <loc>https://sign-language-processing.github.io/</loc>
-  <lastmod>2024-06-06T21:55:27+00:00</lastmod>
+  <lastmod>2024-06-07T19:11:55+00:00</lastmod>
 </url>