Skip to content

Commit bfc1989

Browse files
authored
Code from pose-eval paper (#30)
1 parent e4704ba commit bfc1989

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

55 files changed

+9505
-108
lines changed

.gitignore

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,18 @@ pose_evaluation.egg-info/
77
*.npz
88
*.code-workspace
99
.vscode/
10-
coverage.lcov
10+
coverage.lcov
11+
12+
*.csv
13+
**/out/*
14+
**out*.txt
15+
**metric_results**
16+
*.parquet
17+
*.tex
18+
*.png
19+
**/dtw_plots/*
20+
**/plots/*
21+
**/misc_out/*
22+
**/debug*/*
23+
*.tar.zst
24+
*.zip

README.md

Lines changed: 44 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@
33
The lack of automatic pose evaluation metrics is a major obstacle in the development of
44
sign language generation models.
55

6+
![Distribution of scores](assets/pose-eval-title-picture.png)
7+
68
## Goals
79

810
The primary objective of this repository is to house a suite of
@@ -12,49 +14,41 @@ as well as custom-developed metrics unique to our approach.
1214
We recognize the distinct challenges in evaluating single signs versus continuous signing,
1315
and our methods reflect this differentiation.
1416

15-
1617
---
1718

18-
# TODO:
19-
20-
- [ ] Qualitative Evaluation
21-
- [ ] Quantitative Evaluation
19+
<!-- ## Usage
2220
23-
## Qualitative Evaluation
21+
```bash
22+
# (TODO) pip install the package
23+
# (TODO) how to construct a metric
24+
# Metric signatures, preprocessors
25+
``` -->
2426

25-
To qualitatively demonstrate the efficacy of these evaluation metrics,
26-
we implement a nearest-neighbor search for selected signs from the **TODO** corpus.
27-
The rationale is straightforward: the closer the sign is to its nearest neighbor in the corpus,
28-
the more effective the evaluation metric is in capturing the nuances of sign language transcription and translation.
29-
30-
### Distribution of Scores
31-
32-
Using a sample of the corpus, we compute the any-to-any scores for each metric.
33-
Intuitively, we expect a good metric given any two random signs to produce a bad score, since most signs are unrelated.
34-
This should be reflected in the distribution of scores, which should be skewed towards lower scores.
27+
## Quantitative Evaluation
3528

36-
![Distribution of scores](assets/distribution/all.png)
29+
### Isolated Sign Evaluation
3730

38-
### Nearest Neighbor Search
31+
Given an isolated sign corpus such as ASL Citizen[^2], we repeat the evaluation of Ham2Pose[^1] on our metrics, ranking distance metrics by retrieval performance.
3932

40-
INSERT TABLE HERE
33+
Evaluation is conducted on a combined dataset of ASL Citizen, Sem-Lex[^3], and PopSign ASL[^4].
4134

42-
## Quantitative Evaluation
35+
For each sign class, we use all available samples as targets and sample four times as many distractors, yielding a 1:4 target-to-distractor ratio.
4336

44-
### Isolated Sign Evaluation
37+
For instance, for the sign _HOUSE_ with 40 samples (11 from ASL Citizen, 29 from Sem-Lex), we add 160 distractors and compute pairwise metrics from each target to all 199 other examples (We consistently discard scores for pose files where either the target or distractor could not be embedded with SignCLIP.).
4538

46-
Given an isolated sign corpus such as AUTSL[^2], we repeat the evaluation of Ham2Pose[^1] on our metrics.
39+
Retrieval quality is measured using Mean Average Precision (`mAP↑`) and Precision@10 (`P@10↑`). The complete evaluation covers 5,362 unique sign classes and 82,099 pose sequences.
4740

48-
We also repeat the experiments of Atwell et al.[^3] to evaluate the bias of our metrics on different protected attributes.
41+
After several pilot runs, we finalized a subset of 169 sign classes with at most 20 samples each, ensuring consistent metric coverage. We evaluated 1200 distance-based variants and SignCLIP models with different checkpoints provided by the authors on this subset.
4942

50-
### Continuous Sign Evaluation
43+
The overall results show that DTW-based metrics outperform padding-based baselines. Embedding-based methods, particularly SignCLIP models fine-tuned on in-domain ASL data, achieve the strongest retrieval scores.
5144

52-
We evaluate each metric in the context of continuous signing with our continuous metrics alongside our segmented metrics
53-
and correlate to human judgments.
45+
<!-- Atwell style evaluations didn't get done. Nor did AUTSL -->
5446

5547
## Evaluation Metrics
5648

57-
**TODO** list evaluation metrics here.
49+
For the study, we evaluated over 1200 Pose distance metrics, recording mAP and other retrieval performance characteristics.
50+
51+
We find that the top metric
5852

5953
### Contributing
6054

@@ -65,24 +59,34 @@ Please make sure to run `black pose_evaluation` before submitting a pull request
6559
If you use our toolkit in your research or projects, please consider citing the work.
6660

6761
```bib
68-
@misc{pose-evaluation2024,
69-
title={Pose Evaluation: Metrics for Evaluating Sign Langauge Generation Models},
70-
author={Zifan Jiang, Colin Leong, Amit Moryossef},
62+
@misc{pose-evaluation2025,
63+
title={Meaningful Pose-Based Sign Language Evaluation},
64+
author={Zifan Jiang, Colin Leong, Amit Moryossef, Anne Göhring, Annette Rios, Oliver Cory, Maksym Ivashechkin, Neha Tarigopula, Biao Zhang, Rico Sennrich, Sarah Ebling},
7165
howpublished={\url{https://github.com/sign-language-processing/pose-evaluation}},
72-
year={2024}
66+
year={2025}
7367
}
7468
```
7569

76-
#### Contributions:
77-
- Zifan, Colin, and Amit developed the evaluation metrics and tools.
70+
### Contributions
71+
72+
- Zifan, Colin, and Amit developed the evaluation metrics and tools. Zifan did correlation and human evaluations, Colin did automated meta-eval, KNN, etc.
73+
- Colin and Amit developed the library code.
7874
- Zifan, Anne, and Lisa conducted the qualitative and quantitative evaluations.
7975

8076
## References
8177

82-
[^1]: Rotem Shalev-Arkushin, Amit Moryossef, and Ohad Fried.
83-
2022. [Ham2Pose: Animating Sign Language Notation into Pose Sequences](https://arxiv.org/abs/2211.13613).
84-
[^2]: Ozge Mercanoglu Sincan and Hacer Yalim Keles.
85-
2020. [AUTSL: A Large Scale Multi-modal Turkish Sign Language Dataset and Baseline Methods](https://arxiv.org/abs/2008.00932).
86-
[^3]: Katherine Atwell, Danielle Bragg, and Malihe Alikhani.
87-
2024. [Studying and Mitigating Biases in Sign Language Understanding Models.](https://aclanthology.org/2024.emnlp-main.17/)
88-
In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 268–283, Miami, Florida, USA. Association for Computational Linguistics.
78+
[^1]: Rotem Shalev-Arkushin, Amit Moryossef, and Ohad Fried. 2022. [Ham2Pose: Animating Sign Language Notation into Pose Sequences](https://arxiv.org/abs/2211.13613).
79+
[^2]:
80+
Aashaka Desai, Lauren Berger, Fyodor O. Minakov, Vanessa Milan, Chinmay Singh, Kriston Pumphrey, Richard E. Ladner, Hal Daumé III, Alex X. Lu, Naomi K. Caselli, and Danielle Bragg.
81+
2023. [ASL Citizen: A Community-Sourced Dataset for Advancing Isolated Sign Language Recognition](https://arxiv.org/abs/2304.05934).
82+
_ArXiv_, abs/2304.05934.
83+
84+
[^3]:
85+
Lee Kezar, Elana Pontecorvo, Adele Daniels, Connor Baer, Ruth Ferster, Lauren Berger, Jesse Thomason, Zed Sevcikova Sehyr, and Naomi Caselli.
86+
2023. [The Sem-Lex Benchmark: Modeling ASL Signs and Their Phonemes](https://api.semanticscholar.org/CorpusID:263334197).
87+
_Proceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility_.
88+
89+
[^4]:
90+
Thad Starner, Sean Forbes, Matthew So, David Martin, Rohit Sridhar, Gururaj Deshpande, Sam S. Sepah, Sahir Shahryar, Khushi Bhardwaj, Tyler Kwok, Daksh Sehgal, Saad Hassan, Bill Neubauer, Sofia Anandi Vempala, Alec Tan, Jocelyn Heath, Unnathi Kumar, Priyanka Mosur, Tavenner Hall, Rajandeep Singh, Christopher Cui, Glenn Cameron, Sohier Dane, and Garrett Tanzer.
91+
2023. [PopSign ASL v1.0: An Isolated American Sign Language Dataset Collected via Smartphones](https://api.semanticscholar.org/CorpusID:268030720).
92+
_Neural Information Processing Systems_.

assets/pose-eval-title-picture.png

248 KB
Loading

0 commit comments

Comments
 (0)