`cleanDOI` accepts malformed DOI prefix; general behaviour when input doesn't match

https://github.com/zotero/utilities/blob/b93f16dba483891c0ab4627cbaa303de5c7fa0c0/utilities.js#L415-L426

The regular expression will match a string like "10/example" or "10/10" (as in ten out of ten, great, A+), which is not valid DOI (prefix does not start with `10.`).

In general, `cleanDOI()` returns the first matching *substring* if any. I'm not sure whether this is the intended behaviour that we expect the caller to depend on. Its doc says "Strip info:doi prefix and any suffixes from a DOI", which isn't very clear to me.

In the translators repository, there are currently 36 translators making use of `cleanDOI`. I haven't checked the other Zotero components. I think we need to check how many of those calls depend on the current behaviour, before we go on to improve the regexp here.

---

### My own thoughts

In general, the best that any code here could realistic do, is to provide some reasonable baseline accuracy. There's bound to be false positive and negatives when trying to "parse" or clean DOI because DOIs don't obey grammar. The DOI spec basically said they could be anything (printable Unicode graphic characters). It is explicitly said to be an opaque string, and nothing is supposed to be deduced based on features of the string alone.

So for the general-purpose utility function like this, we need to set realistic expectation and clearly document what it does exactly. We may have to leave the caller (especially translators) to implement their own further processing. The reason is that, given the opaque and arbitrary nature of DOIs and the diverse ways they can appear in, the best improvements should not come from the sophistication of a generic filter, but from domain-specific knowledge. Which is what goes into the individual translators.

But we need to check how many of the translators calling `cleanDOI` agree with this...

	/**
	* Strip info:doi prefix and any suffixes from a DOI
	* @type String
	*/
	cleanDOI: function(/String/ x) {
	if(typeof(x) != "string") {
	throw new Error("cleanDOI: argument must be a string");
	}

	var doi = x.match(/10(?:\.[0-9]{4,})?\/[^\s]*[^\s\.,]/);
	return doi ? doi[0] : null;
	},

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`cleanDOI` accepts malformed DOI prefix; general behaviour when input doesn't match #24

My own thoughts

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

cleanDOI accepts malformed DOI prefix; general behaviour when input doesn't match #24

Description

My own thoughts

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`cleanDOI` accepts malformed DOI prefix; general behaviour when input doesn't match #24