Project Extension

Now that we know that using a translation model is [beneficial](https://github.com/sign-language-processing/signwriting-translation?tab=readme-ov-file#motivating-examples), we would like to make it more robust.
Specifically:

1. We find that the model works decently when the input is a single word or a short sentence, 
but not when the input is a long sentence or a paragraph. (In practice, we use sentence-splitting before translating, but this is not ideal, for context dependent info)
2. The model might not be accurate to simple semantic variations (desk vs table), likely since it is trained 
from scratch, with a low-data setting.

To address these issues, we propose curating multiple data sources and fine-tuning LLMs.
1. The parallel data from SignBank+ is of good quality (not perfect).
2. We can use monolingual data alongside language models to generate synthetic sentence level data.
This would be similar to [this paper](https://aclanthology.org/D19-1143.pdf) replacing the "rule-based" approach with a large language model.
3. Key phrases can be extracted from the SignBank+ data, and understood as "template + slots"
including fingerspelling can be used to generate high quality synthetic data by replacing the fingerspelled entity.
4. Large sign language translation datasets can be automatically [segmented](https://github.com/sign-language-processing/segmentation) and [transcribed](https://github.com/sign-language-processing/signwriting-transcription). This will create a large multilingual parallel **document level** dataset, with low quality SignWriting.

Once data is collected, we will need to find a training recipe that makes sense with 
multiple languages and various data proportions, for either of the translation directions.

We would treat the existing models as baselines, and evaluate SignWriting output using [signwriting-evaluation](https://github.com/sign-language-processing/signwriting-evaluation)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Project Extension #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Project Extension #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions