-
Notifications
You must be signed in to change notification settings - Fork 13
Add some missing references #117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -538,10 +538,9 @@ Though some previous works have referred to this as "sign language translation," | |
without handling the syntax and morphology of the signed language [@padden1988interaction] to create a spoken language output. | ||
Instead, SLR has often been used as an intermediate step during translation to produce glosses from signed language videos. | ||
|
||
@jiang2021sign proposed a novel Skeleton Aware Multi-modal Framework with a Global Ensemble Model (GEM) for isolated SLR (SAM-SLR-v2) to learn and fuse multimodal feature representations. Specifically, they use a Sign Language Graph Convolution Network (SL-GCN) to model the embedded dynamics of skeleton keypoints and a Separable Spatial-Temporal Convolution Network (SSTCN) to exploit skeleton features. The proposed late-fusion GEM fuses the skeleton-based predictions with other RGB and depth-based modalities to provide global information and make an accurate SLR prediction. | ||
@jiang2021sign propose a novel Skeleton Aware Multi-modal Framework with a Global Ensemble Model (GEM) for isolated SLR (SAM-SLR-v2) to learn and fuse multimodal feature representations. Specifically, they use a Sign Language Graph Convolution Network (SL-GCN) to model the embedded dynamics of skeleton keypoints and a Separable Spatial-Temporal Convolution Network (SSTCN) to exploit skeleton features. The proposed late-fusion GEM fuses the skeleton-based predictions with other RGB and depth-based modalities to provide global information and make an accurate SLR prediction. @jiao2023cosign explore co-occurence signals in skeleton data to better exploit the knowledge of each signal for continuous SLR. Specifically, they use Group-specific GCN to abstract skeleton features from co-occurence signals (Body, Hand, Mouth and Hand) and introduce complementary regularization to ensure consistency between predictions based on two complementary subsets of signals. Additionally, they propose a two-stream framework to fuse static and dynamic information. The model demonstrates competitive performance cpmpared to video-to-gloss methods on the RWTH-PHOENIX-Weather-2014 [@koller2015ContinuousSLR], RWTH-PHOENIX-Weather-2014T [@cihan2018neural] and CSL-Daily [@dataset:Zhou2021_SignBackTranslation_CSLDaily] datasets. | ||
|
||
@dafnis2022bidirectional work on the same modified WLASL dataset as @jiang2021sign, but do not require multimodal data input. Instead, they propose a bidirectional skeleton-based graph convolutional network framework with linguistically motivated parameters and attention to the start and end | ||
frames of signs. They cooperatively use forward and backward data streams, including various sub-streams, as input. They also use pre-training to leverage transfer learning. | ||
@dafnis2022bidirectional work on the same modified WLASL dataset as @jiang2021sign, but do not require multimodal data input. Instead, they propose a bidirectional skeleton-based graph convolutional network framework with linguistically motivated parameters and attention to the start and end frames of signs. They cooperatively use forward and backward data streams, including various sub-streams, as input. They also use pre-training to leverage transfer learning. | ||
AmitMY marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
@selvaraj-etal-2022-openhands introduced an open-source [OpenHands](https://github.com/AI4Bharat/OpenHands) library, | ||
which consists of standardized pose datasets for different existing sign language datasets and trained checkpoints | ||
|
@@ -587,7 +586,11 @@ For this recognition, @cui2017recurrent constructs a three-step optimization mod | |
First, they train a video-to-gloss end-to-end model, where they encode the video using a spatio-temporal CNN encoder | ||
and predict the gloss using a Connectionist Temporal Classification (CTC) [@graves2006connectionist]. | ||
Then, from the CTC alignment and category proposal, they encode each gloss-level segment independently, trained to predict the gloss category, | ||
and use this gloss video segments encoding to optimize the sequence learning model. | ||
and use this gloss video segments encoding to optimize the sequence learning model. @cheng2020fully propose a fully convolutional networks for continuous SLR, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (here as well, new line before sentence) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "a fully convolutional networks" should be "fully convolutional networks" or "a fully convolutional network" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the suggestion, I have revised this sentence. |
||
moving away from LSTM-based methods to achieve end-to-end learning. They introduce a gloss feature enhancement (GFE) module to provide additional rectified supervision and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. gloss feature enhancement should be capitalized (Gloss Feature Enhancement) because an acronym is introduced There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the suggestion, I have revised this sentence. |
||
accelerate the training process. @min2021visual attribute the success of iterative training to its ability to reduce overfitting. They propose visual enhancement | ||
constraint (VEC) and visual alignment constraint (VAC) to strengthen the visual extractor and align long- and short-term predictions, enabling LSTM-based methods to be trained in an end-to-end manner. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "visual enhancement constraint" should be capitalized, same for "visual alignment constraint" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the suggestion, I have capitalized them. |
||
They provide a [code implementation](https://github.com/VIPL-SLP/VAC_CSLR). | ||
|
||
@cihan2018neural fundamentally differ from that approach and formulate this problem as if it is a natural-language translation problem. | ||
They encode each video frame using AlexNet [@krizhevsky2012imagenet], initialized using weights trained on ImageNet [@deng2009imagenet]. | ||
|
@@ -742,6 +745,10 @@ The model features shared representations for different modalities such as text | |
on several tasks such as video-to-gloss, gloss-to-text, and video-to-text. | ||
The approach allows leveraging external data such as parallel data for spoken language machine translation. | ||
|
||
@zhou2023gloss propose the GFSLT-VLP framework for gloss-free sign language translation, which improves SLT performance through visual-alignment pretraining. In the pretraining stage, they design a pretext task that aligns visual and textual | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. better imo from
to
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the suggestion, I have revised this sentence. |
||
representations within a joint multimodal semantic space, enabling the Visual Encoder to learn language-indicated visual representations. Additionally, they incorporate masked self-supervised learning into the pre-training | ||
process to help the text decoder capture the syntactic and semantic properties of sign language sentences more effectively. The approach achieves competitive results on the RWTH-PHOENIX-Weather-2014T [@cihan2018neural] and CSL-Daily [@dataset:Zhou2021_SignBackTranslation_CSLDaily] datasets. They provide a [code implementation](https://github.com/zhoubenjia/GFSLT-VLP). | ||
|
||
@Zhao_Zhang_Fu_Hu_Su_Chen_2024 introduce CV-SLT, employing conditional variational autoencoders to address the modality gap between video and text. | ||
Their approach involves guiding the model to encode visual and textual data similarly through two paths: one with visual data alone and one with both modalities. | ||
Using KL divergences, they steer the model towards generating consistent embeddings and accurate outputs regardless of the path. | ||
|
@@ -750,7 +757,6 @@ Evaluation on the RWTH-PHOENIX-Weather-2014T [@cihan2018neural] and CSL-Daily [@ | |
They provide a [code implementation](https://github.com/rzhao-zhsq/CV-SLT) based largely on @chenSimpleMultiModalityTransfer2022a. | ||
<!-- The CV-SLT code looks pretty nice! Conda env file, data prep, not too old, paths in .yaml files, checkpoints provided (including the ones for replication), commands to train and evaluate, very nice --> | ||
|
||
|
||
<!-- TODO: the "previous gloss-free frameworks" that gongLLMsAreGood2024 cite are: Gloss Attention for Gloss-free Sign Language Translation (2023) and Gloss-free sign language translation: Improving from visual-language pretraining, 2023 aka GFSLT-VLP. Could be good to lead into it with explanations of those? --> | ||
|
||
@gongLLMsAreGood2024 introduce SignLLM, a framework for gloss-free sign language translation that leverages the strengths of Large Language Models (LLMs). | ||
|
@@ -792,6 +798,10 @@ and showed similar performance, with the transformer underperforming on the vali | |
They experimented with various normalization schemes, mainly subtracting the mean and dividing by the standard deviation of every individual keypoint | ||
either concerning the entire frame or the relevant "object" (Body, Face, and Hand). | ||
|
||
@jiao2024visual propose a visual alignment pre-training framework for gloss-free sign language translation. Specifically, they adopt Cosign-1s [@jiao2023cosign] to obtain skeleton features from estimated pose sequences | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CoSign, thanks! |
||
and a pretrained text encoder to obtain corresponding textual features. During pretraining, these visual and textual features are aligned in a greedy manner. In the finetuning stage, they replace the shallow translation module | ||
used in pretraining with a pretrained translation module. This skeleton-based approach achieves state-of-the-art results on the RWTH-PHOENIX-Weather-2014T [@cihan2018neural], CSL-Daily [@dataset:Zhou2021_SignBackTranslation_CSLDaily], OpenASL [@shi-etal-2022-open], and How2Sign[@dataset:duarte2020how2sign] datasets without relying on gloss annotations. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. missing space after How2Sign There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The space has been added. |
||
|
||
#### Text-to-Pose | ||
Text-to-Pose, also known as sign language production, is the task of producing a sequence of poses that adequately represent | ||
a spoken language text in sign language, as an intermediate representation to overcome challenges in animation. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1546,6 +1546,14 @@ @article{jiang2021sign | |
year = {2021} | ||
} | ||
|
||
@inproceedings{jiao2023cosign, | ||
title={CoSign: Exploring co-occurrence signals in skeleton-based continuous sign language recognition}, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. need to add There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the suggestion, I have revised this reference item. |
||
author={Jiao, Peiqi and Min, Yuecong and Li, Yanan and Wang, Xiaotao and Lei, Lei and Chen, Xilin}, | ||
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, | ||
pages={20676--20686}, | ||
year={2023} | ||
} | ||
|
||
@inproceedings{dafnis2022bidirectional, | ||
author = {Dafnis, Konstantinos M and Chroni, Evgenia and Neidle, Carol and Metaxas, Dimitris N}, | ||
booktitle = {Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), Marseille, 20-25 June 2022.}, | ||
|
@@ -1626,6 +1634,23 @@ @article{cui2019deep | |
year = {2019} | ||
} | ||
|
||
@inproceedings{cheng2020fully, | ||
title={Fully convolutional networks for continuous sign language recognition}, | ||
author={Cheng, Ka Leong and Yang, Zhaoyang and Chen, Qifeng and Tai, Yu-Wing}, | ||
booktitle={Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXIV 16}, | ||
pages={697--714}, | ||
year={2020}, | ||
organization={Springer} | ||
} | ||
|
||
@inproceedings{min2021visual, | ||
title={Visual alignment constraint for continuous sign language recognition}, | ||
author={Min, Yuecong and Hao, Aiming and Chai, Xiujuan and Chen, Xilin}, | ||
booktitle={Proceedings of the IEEE/CVF international conference on computer vision}, | ||
pages={11542--11551}, | ||
year={2021} | ||
} | ||
|
||
@article{carreira2017quo, | ||
author = {Carreira, Joao and Zisserman, Andrew}, | ||
journal = {ArXiv preprint}, | ||
|
@@ -3044,6 +3069,23 @@ @inproceedings{chen2022TwoStreamNetworkSign | |
year = {2022} | ||
} | ||
|
||
@inproceedings{zhou2023gloss, | ||
title={Gloss-free sign language translation: Improving from visual-language pretraining}, | ||
author={Zhou, Benjia and Chen, Zhigang and Clap{\'e}s, Albert and Wan, Jun and Liang, Yanyan and Escalera, Sergio and Lei, Zhen and Zhang, Du}, | ||
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, | ||
pages={20871--20881}, | ||
year={2023} | ||
} | ||
|
||
@inproceedings{jiao2024visual, | ||
title={Visual Alignment Pre-training for Sign Language Translation}, | ||
author={Jiao, Peiqi and Min, Yuecong and Chen, Xilin}, | ||
booktitle={European Conference on Computer Vision}, | ||
pages={349--367}, | ||
year={2024}, | ||
organization={Springer} | ||
} | ||
|
||
@inproceedings{xie2018SpatiotemporalS3D, | ||
address = {Cham}, | ||
author = {Xie, Saining | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to minimize the diff, and for organization in more than one line, please add a new line before @jiao2023cosign (it will still show in one paragraph). I'd even propose to add a new line after every end of sentence, to make it easier to give comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(but this paragraph looks good to me otherwise!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have divided this paragraph into individual sentences to more clearly highlight the distinctions. Besides, in the previous version, I changed the tense of the previous sentence from past tense to present tense (@jiang2021sign proposed -> @jiang2021sign propose), and I recover the original version in the updated version.
Currently, it appears that the tenses in this project are not consistent and may require an overall review and correction.