-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
cirrus-scripts/bitextor-buildTMX.py
Lines 180 to 184 in 61765e3
| elif prev_hash == line_hash and options.dedup: | |
| urls1.update(fieldsdict['url1'].split(' ')) | |
| urls2.update(fieldsdict['url2'].split(' ')) | |
| if 'collection' in fieldsdict.keys(): | |
| collections.add(fieldsdict['collection']) |
Martin Popel pointed out that if we do it this way, say we have 10.000 pairs of Yes -> Ja in the data, and one Yes -> Fuck off, both make it into the TMX with a single entry. When then someone wants to deduplicate on the source side of the sentence pairs, and has to make a decision which pair to keep, having the frequency information might be quite helpful.
Metadata
Metadata
Assignees
Labels
No labels