-
Notifications
You must be signed in to change notification settings - Fork 3
Description
There are many occurrences of papers that are duplicated in the dataset. In a sample of ~100 mentions, we found 15 duplicated papers. That is, papers that have different DOIs but are actually the same paper but in different revision states or reviewer discussion pages. For example, the following 6 DOIs point to exactly the same paper - the first one is the finally published version, and the other five are their revision and reviewers/authors' comments on the revision, all linking to the same PDF. So, all the package mentions are counted 6 times for one paper:
http://doi.org/10.5194/esd-12-621-2021
http://doi.org/10.5194/esd-2021-5
http://doi.org/10.5194/esd-2021-5-rc2
http://doi.org/10.5194/esd-2021-5-rc1
http://doi.org/10.5194/esd-2021-5-ac2
http://doi.org/10.5194/esd-2021-5-ac1
And this is just one of many examples. I know that this is a DOI related problem that you might potentially not be able to account for, but did you somehow try to handle this in your dataset and is there a way to filter all these duplicates from the dataset? I see that this is far from trivial, but I am curios whether you are aware of this problem and have any suggestions how we could deal with this and similar duplicates. (This might also apply to different versions of the same paper uploaded on arxiv, which could also potentially have different titles but are versions of the same paper.)
I can also provide you with a second example: All the following four belong to different review comments of the same paper. (The paper itself is missing here in the example):
http://doi.org/10.7287/peerj-cs.544v0.2/reviews/2
http://doi.org/10.7287/peerj-cs.544v0.2/reviews/1
http://doi.org/10.7287/peerj-cs.544v0.1/reviews/1
http://doi.org/10.7287/peerj-cs.544v0.1/reviews/2
And here is a third example - again, three DOIs that belong to the same paper:
http://doi.org/10.2139/ssrn.3132206
http://doi.org/10.2139/ssrn.3132204
http://doi.org/10.2139/ssrn.3135767
And here is a fourth example:
http://doi.org/10.7287/peerj-cs.175v0.2/reviews/3
http://doi.org/10.7287/peerj-cs.175v0.1/reviews/1
http://doi.org/10.7287/peerj-cs.175v0.1/reviews/2
And yet another example: The first one is the actual paper, the second one a download page containing the same PDF together with a discussion:
http://doi.org/10.5194/gmd-11-2475-2018
http://doi.org/10.5194/gmd-2018-15
And two more examples - however, the following ones are different: While for the four examples above, all DOIs belong to the same paper at the same publisher, here the paper is published twice: Once as a preprint and once as the actual paper on the journal's website:
http://doi.org/10.1007/s10055-022-00685-9
http://doi.org/10.21203/rs.3.rs-1361876/v1
http://doi.org/10.1162/qss_a_00245
http://doi.org/10.31222/osf.io/w5szk
So, I guess there is no way to link preprint and paper?