Skip to content

cleanDOI accepts malformed DOI prefix; general behaviour when input doesn't match #24

@zoe-translates

Description

@zoe-translates

utilities/utilities.js

Lines 415 to 426 in b93f16d

/**
* Strip info:doi prefix and any suffixes from a DOI
* @type String
*/
cleanDOI: function(/**String**/ x) {
if(typeof(x) != "string") {
throw new Error("cleanDOI: argument must be a string");
}
var doi = x.match(/10(?:\.[0-9]{4,})?\/[^\s]*[^\s\.,]/);
return doi ? doi[0] : null;
},

The regular expression will match a string like "10/example" or "10/10" (as in ten out of ten, great, A+), which is not valid DOI (prefix does not start with 10.).

In general, cleanDOI() returns the first matching substring if any. I'm not sure whether this is the intended behaviour that we expect the caller to depend on. Its doc says "Strip info:doi prefix and any suffixes from a DOI", which isn't very clear to me.

In the translators repository, there are currently 36 translators making use of cleanDOI. I haven't checked the other Zotero components. I think we need to check how many of those calls depend on the current behaviour, before we go on to improve the regexp here.


My own thoughts

In general, the best that any code here could realistic do, is to provide some reasonable baseline accuracy. There's bound to be false positive and negatives when trying to "parse" or clean DOI because DOIs don't obey grammar. The DOI spec basically said they could be anything (printable Unicode graphic characters). It is explicitly said to be an opaque string, and nothing is supposed to be deduced based on features of the string alone.

So for the general-purpose utility function like this, we need to set realistic expectation and clearly document what it does exactly. We may have to leave the caller (especially translators) to implement their own further processing. The reason is that, given the opaque and arbitrary nature of DOIs and the diverse ways they can appear in, the best improvements should not come from the sophistication of a generic filter, but from domain-specific knowledge. Which is what goes into the individual translators.

But we need to check how many of the translators calling cleanDOI agree with this...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions