Skip to content

amend drop duplicate behaviour in starter notebook#5

Open
Ari-Ramkilowan wants to merge 2 commits intomasakhane-io:masterfrom
Ari-Ramkilowan:master
Open

amend drop duplicate behaviour in starter notebook#5
Ari-Ramkilowan wants to merge 2 commits intomasakhane-io:masterfrom
Ari-Ramkilowan:master

Conversation

@Ari-Ramkilowan
Copy link
Collaborator

Changed drop duplicate behaviour to remove rows only when source AND target text are duplicates. Allowing for instances when source text may have multiple valid translations

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB.

@jaderabbit
Copy link
Member

Let's hold off on merging this one until we've discussed @dwhitena's ideas

I think that taking 100 of the duplicates and getting a isiZulu/isiXhosa speaker to review them would be ideal. We work with an amazing isiZulu linguist if you'd like an expert to check if the duplicates are valid translations. Let me know and I'll do an intro email!

@dwhitena
Copy link
Contributor

@jaderabbit and @Ari-Ramkilowan, thanks for the PR and discussion here. My ideas are the following:

If we are able to get human review of the conflicting translations, that would be ideal. @jaderabbit might know how feasible this is, but it seems like it may be possible based on the above comments.

If we can't get human supervision, we try something like fast-align to score sentence pairs and weed out bad pairs, then the removal of conflicting translations is probably moot. The downside to this sort of approach is that it is rather slow to create and run this language model, so we may only want to run it for conflicting translations for language pairs where human review isn't possible.

Any other thoughts?

@juliakreutzer
Copy link
Collaborator

Thanks for adding the additional files @Ari-Ramkilowan! Do you have a link to the checkpoint of a trained model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants