Skip to content

Use custom dataset without hard negatives #34

@ryangawei

Description

@ryangawei

Hi,

I'm trying to create preprocessed training files using my custom data. My data doesn't include any hard negatives, and when I use your script create_training_files.py, errors show up saying no triplets are constructed:

2021-08-20 14:30:58,836,836 INFO [create_training_files.py:453] loading metadata: ../../data/specter/metadata.json
2021-08-20 14:30:58,907,907 INFO [create_training_files.py:457] loading data file: ../../data/specter/data.json
2021-08-20 14:30:59,040,40 INFO [create_training_files.py:466] getting instances for `data` and `train` set
2021-08-20 14:30:59,041,41 INFO [create_training_files.py:468] writing output ../../data/specter/preprocessed/data-train.p
2021-08-20 14:30:59,101,101 INFO [create_training_files.py:303] Generating triplets ...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 85452/85452 [01:00<00:00, 1404.12it/s]
INFO:/home/guoao/anaconda3/envs/specter/lib/python3.7/site-packages/specter-0.0.1-py3.7.egg/specter/data_utils/triplet_sampling.py:Done generating triplets, #successful queries: 0,#skipped queries: 85452
2021-08-20 14:32:01,745,745 INFO [create_training_files.py:365] done getting triplets, success rate:0.00%,total: 0
2021-08-20 14:32:01,746,746 INFO [create_training_files.py:407] converting raw instances to allennlp instances:
0it [00:00, ?it/s]

Then I dive into the script specter/data_utils/triplet_sampling.py to use TripletGenerator and see what happens (since I can't use breakpoints in multiprocess programs). I find out that since there're no hard negatives, the margin here becomes 0.0, making the candidates_pos a blank list.

If I change the line to if candidates[j][1] >= margin + candidates[-1][1]:, the function will work. I don't really understand the meaning of margin and not sure if changing the line will impact the generation results or not. So I wonder if it's safe to do so?

Thank!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions