You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+44-40Lines changed: 44 additions & 40 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,6 +3,8 @@
3
3
The lack of automatic pose evaluation metrics is a major obstacle in the development of
4
4
sign language generation models.
5
5
6
+

7
+
6
8
## Goals
7
9
8
10
The primary objective of this repository is to house a suite of
@@ -12,49 +14,41 @@ as well as custom-developed metrics unique to our approach.
12
14
We recognize the distinct challenges in evaluating single signs versus continuous signing,
13
15
and our methods reflect this differentiation.
14
16
15
-
16
17
---
17
18
18
-
# TODO:
19
-
20
-
-[ ] Qualitative Evaluation
21
-
-[ ] Quantitative Evaluation
19
+
<!-- ## Usage
22
20
23
-
## Qualitative Evaluation
21
+
```bash
22
+
# (TODO) pip install the package
23
+
# (TODO) how to construct a metric
24
+
# Metric signatures, preprocessors
25
+
``` -->
24
26
25
-
To qualitatively demonstrate the efficacy of these evaluation metrics,
26
-
we implement a nearest-neighbor search for selected signs from the **TODO** corpus.
27
-
The rationale is straightforward: the closer the sign is to its nearest neighbor in the corpus,
28
-
the more effective the evaluation metric is in capturing the nuances of sign language transcription and translation.
29
-
30
-
### Distribution of Scores
31
-
32
-
Using a sample of the corpus, we compute the any-to-any scores for each metric.
33
-
Intuitively, we expect a good metric given any two random signs to produce a bad score, since most signs are unrelated.
34
-
This should be reflected in the distribution of scores, which should be skewed towards lower scores.
27
+
## Quantitative Evaluation
35
28
36
-

29
+
### Isolated Sign Evaluation
37
30
38
-
### Nearest Neighbor Search
31
+
Given an isolated sign corpus such as ASL Citizen[^2], we repeat the evaluation of Ham2Pose[^1] on our metrics, ranking distance metrics by retrieval performance.
39
32
40
-
INSERT TABLE HERE
33
+
Evaluation is conducted on a combined dataset of ASL Citizen, Sem-Lex[^3], and PopSign ASL[^4].
41
34
42
-
## Quantitative Evaluation
35
+
For each sign class, we use all available samples as targets and sample four times as many distractors, yielding a 1:4 target-to-distractor ratio.
43
36
44
-
### Isolated Sign Evaluation
37
+
For instance, for the sign _HOUSE_ with 40 samples (11 from ASL Citizen, 29 from Sem-Lex), we add 160 distractors and compute pairwise metrics from each target to all 199 other examples (We consistently discard scores for pose files where either the target or distractor could not be embedded with SignCLIP.).
45
38
46
-
Given an isolated sign corpus such as AUTSL[^2], we repeat the evaluation of Ham2Pose[^1] on our metrics.
39
+
Retrieval quality is measured using Mean Average Precision (`mAP↑`) and Precision@10 (`P@10↑`). The complete evaluation covers 5,362 unique sign classes and 82,099 pose sequences.
47
40
48
-
We also repeat the experiments of Atwell et al.[^3] to evaluate the bias of our metrics on different protected attributes.
41
+
After several pilot runs, we finalized a subset of 169 sign classes with at most 20 samples each, ensuring consistent metric coverage. We evaluated 1200 distance-based variants and SignCLIP models with different checkpoints provided by the authors on this subset.
49
42
50
-
### Continuous Sign Evaluation
43
+
The overall results show that DTW-based metrics outperform padding-based baselines. Embedding-based methods, particularly SignCLIP models fine-tuned on in-domain ASL data, achieve the strongest retrieval scores.
51
44
52
-
We evaluate each metric in the context of continuous signing with our continuous metrics alongside our segmented metrics
53
-
and correlate to human judgments.
45
+
<!-- Atwell style evaluations didn't get done. Nor did AUTSL -->
54
46
55
47
## Evaluation Metrics
56
48
57
-
**TODO** list evaluation metrics here.
49
+
For the study, we evaluated over 1200 Pose distance metrics, recording mAP and other retrieval performance characteristics.
50
+
51
+
We find that the top metric
58
52
59
53
### Contributing
60
54
@@ -65,24 +59,34 @@ Please make sure to run `black pose_evaluation` before submitting a pull request
65
59
If you use our toolkit in your research or projects, please consider citing the work.
66
60
67
61
```bib
68
-
@misc{pose-evaluation2024,
69
-
title={Pose Evaluation: Metrics for Evaluating Sign Langauge Generation Models},
70
-
author={Zifan Jiang, Colin Leong, Amit Moryossef},
62
+
@misc{pose-evaluation2025,
63
+
title={Meaningful Pose-Based Sign Language Evaluation},
64
+
author={Zifan Jiang, Colin Leong, Amit Moryossef, Anne Göhring, Annette Rios, Oliver Cory, Maksym Ivashechkin, Neha Tarigopula, Biao Zhang, Rico Sennrich, Sarah Ebling},
- Zifan, Colin, and Amit developed the evaluation metrics and tools.
70
+
### Contributions
71
+
72
+
- Zifan, Colin, and Amit developed the evaluation metrics and tools. Zifan did correlation and human evaluations, Colin did automated meta-eval, KNN, etc.
73
+
- Colin and Amit developed the library code.
78
74
- Zifan, Anne, and Lisa conducted the qualitative and quantitative evaluations.
79
75
80
76
## References
81
77
82
-
[^1]: Rotem Shalev-Arkushin, Amit Moryossef, and Ohad Fried.
83
-
2022.[Ham2Pose: Animating Sign Language Notation into Pose Sequences](https://arxiv.org/abs/2211.13613).
84
-
[^2]: Ozge Mercanoglu Sincan and Hacer Yalim Keles.
85
-
2020.[AUTSL: A Large Scale Multi-modal Turkish Sign Language Dataset and Baseline Methods](https://arxiv.org/abs/2008.00932).
86
-
[^3]: Katherine Atwell, Danielle Bragg, and Malihe Alikhani.
87
-
2024.[Studying and Mitigating Biases in Sign Language Understanding Models.](https://aclanthology.org/2024.emnlp-main.17/)
88
-
In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 268–283, Miami, Florida, USA. Association for Computational Linguistics.
78
+
[^1]: Rotem Shalev-Arkushin, Amit Moryossef, and Ohad Fried. 2022. [Ham2Pose: Animating Sign Language Notation into Pose Sequences](https://arxiv.org/abs/2211.13613).
79
+
[^2]:
80
+
Aashaka Desai, Lauren Berger, Fyodor O. Minakov, Vanessa Milan, Chinmay Singh, Kriston Pumphrey, Richard E. Ladner, Hal Daumé III, Alex X. Lu, Naomi K. Caselli, and Danielle Bragg.
81
+
2023.[ASL Citizen: A Community-Sourced Dataset for Advancing Isolated Sign Language Recognition](https://arxiv.org/abs/2304.05934).
82
+
_ArXiv_, abs/2304.05934.
83
+
84
+
[^3]:
85
+
Lee Kezar, Elana Pontecorvo, Adele Daniels, Connor Baer, Ruth Ferster, Lauren Berger, Jesse Thomason, Zed Sevcikova Sehyr, and Naomi Caselli.
86
+
2023.[The Sem-Lex Benchmark: Modeling ASL Signs and Their Phonemes](https://api.semanticscholar.org/CorpusID:263334197).
87
+
_Proceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility_.
88
+
89
+
[^4]:
90
+
Thad Starner, Sean Forbes, Matthew So, David Martin, Rohit Sridhar, Gururaj Deshpande, Sam S. Sepah, Sahir Shahryar, Khushi Bhardwaj, Tyler Kwok, Daksh Sehgal, Saad Hassan, Bill Neubauer, Sofia Anandi Vempala, Alec Tan, Jocelyn Heath, Unnathi Kumar, Priyanka Mosur, Tavenner Hall, Rajandeep Singh, Christopher Cui, Glenn Cameron, Sohier Dane, and Garrett Tanzer.
91
+
2023.[PopSign ASL v1.0: An Isolated American Sign Language Dataset Collected via Smartphones](https://api.semanticscholar.org/CorpusID:268030720).
0 commit comments