-
Notifications
You must be signed in to change notification settings - Fork 359
Use Citation Style Language (CSL) in Citation view helper. #1324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
- Derived from work by Mathieu Grimault on VUFIND-889
I was thinking for configuration, something like this in config.ini (Record section) ; This setting controls which citations are available;
; comment all lines to disable citations
; use ini array syntax to activate only selected formats.
; for available options,
; - see the repository: https://github.com/citation-style-language/styles
; - see the Citation Styles website: https://citationstyles.org/authors/
citation_formats[] = apa:APA Citation ; default
citation_formats[] = chicago-annotated-bibliography:Chicago Style Citation ; default
citation_formats[] = modern-language-association:MLA Citation ; default
citation_formats[] = acta-anaesthesiologica-scandinavica:Custom Citation Example |
I (maybe foolishly) think it would be quite simple to add an area where users can enter a CSL to render but the names aren't very straightforward and I think a dropdown or an autocomplete might leave something to be desired... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for moving this forward, @crhallberg -- I've left a few comments on the latest changes. I also merged in the latest dev and resolved composer.json/lock conflicts.
return trim($text, " \n\r\t\v\0,:;/"); | ||
} | ||
|
||
protected function removeDateRange($name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the existing cleanNameDates function instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleanNameDates
didn't cover all the cases. I do intend to clean unused functions out when the new config is more integrated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be interested to know what cases aren't covered, because I think this is exactly the intended purpose for cleanNameDates
. Maybe we need to fix that one rather than add a second function.
return preg_replace('/\d+[ ]*\-[ ]*\d*/', '', $name); | ||
} | ||
|
||
protected function nameToGivenFamily($name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this is going to need some more logic added. What if the name is already in "First Last" format? What if there is a suffix like ", Jr." or a title like ", Mrs."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am looking at the other functions in this class, and they don't seem to cover all those cases. Doesn't mean we shouldn't though! I'll see if I can find a more robust parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a lot of the other functions are already assuming names are formatted in a particular way, and most of the name processing is designed to fix non-conformant strings. But if we need to parse out specific elements, it can get complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a composer library: https://github.com/joshfraser/PHP-Name-Parser.
We would have to convert LastName, FirstName
to FirstName LastName
it seems.
Their tests actually have a list of unhandled cases: https://github.com/joshfraser/PHP-Name-Parser/blob/master/tests/FullNameParserTest.php#L559
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the lack of "Last, First" support is disappointing, since we'd have to write our own parser to deal with complex names with multiple components (e.g. "Miller, Alex. McVeigh, Mrs." or "Smith, John, III") before we could use the third-party parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one will likely work better for our purposes: https://github.com/theiconic/name-parser. I've found a few more robust libraries in other languages, but I don't know if it's worth the conversion effort. I see a lot of unit tests in my future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That library does look potentially like a better fit, though the lack of activity in two years is somewhat concerning.
@@ -192,6 +192,47 @@ public function getCitation($format) | |||
return ''; | |||
} | |||
|
|||
protected function trimPunctuation($text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the existing stripPunctuation method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stripPunctuation
was removing all punctuation when I just wanted to trim it from the ends.
stripPunctuation('J. J. J. Smith \.')
== 'J J J Smith'trimPunctuation('J. J. J. Smith \.')
=> 'J. J. J. Smith'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? The function comment is "Strip unwanted punctuation from the right side of a string." The code doesn't look to me like it should be doing anything other than trimming the end. If anything, it looks like it might not be aggressive enough for some scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you're right. I misread the functionality. We'll keep only one at the end, regardless of which we go with.
@@ -223,56 +262,71 @@ public function getDataCSL() | |||
default: | |||
$item['type'] = 'book'; | |||
} | |||
// subtitle -> shortTile | |||
if (!empty($this->details['subtitle'])) { | |||
$item['shortTitle'] = $this->details['subtitle']; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mapping subtitle to short title seems like the wrong thing to do, so I understand why you removed it... but do we need to use subtitle in a different field, or is it not needed at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see support for subtitle in the schema. The schema has a few strange choices in it like that.
[ | ||
'ISSN' => $this->driver->getISSNs(), | ||
'volume' => $this->driver->getContainerIssue(), | ||
'volume-title' => $this->driver->getContainerVolume(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this is right? What if there's a container title, container volume number and container issue number, like "New York Weekly, v. 54, no. 7"? (I can't remember exactly how this is represented in the driver object, but we should double check that we're doing it right here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be worth having a quick one on one and go through the spec to make sure I'm seeing all the options.
These are the confusing/possibly misimplemented bits:
- subtitle
- volume
- dates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also be a good idea to create an articles.mrc file in the standard tests/data directory along with an articles.mrc.properties file that adds the article-specific indexing rules (see relevant comments in marc_local.properties). This way, we could have some baseline article data in our standard test index to work with... would make integration testing easier as well as comparison between new and old code. Would you like me to try to add something here? I agree that a conversation would be useful, but I think it will be easier when we have realistic test data to look at as we talk.
Also, @crhallberg, I think your configuration suggestion makes sense -- if we add array support to the citation_formats setting it will improve readability if things get complicated. We should probably retain support for a comma-separated string for backward compatibility, though. The other question is whether we retain the legacy options or force use of CSL. We could, in theory, add a rule that if there's a colon in the citation name in the configuration, that means we're using CSL and mapping the CSL name to a human-readable name... and if there's no colon, it's a legacy rule. But maybe maintaining legacy is not worth it. (It would be nice at least in the short term so that we could compare the output of our old code and the new CSL code). Regarding a custom control, I wonder if there's a database of CSL sheets to human-readable names somewhere that we could take advantage of... and in general, whether or not we can automatically provide a long list, we should think about a UI for situations where there are dozens of formats turned on, since that's now theoretically possible. |
I don't think maintaining legacy in this case would be difficult because we have very limited citation formats. I think we could map them behind the scenes. |
One thing that I did which I want feedback on: deprecated With so many formats now supported, it seems insane to try and keep this list up to date. I did add a |
Thanks, @crhallberg, this is interesting! I'm curious why unwanted dates are slipping through in the legacy code (I thought we had handled that pretty well, so I wonder if you've found a new edge case in your test record). I also wonder whether the CSL is for the stated versions of the various standards, since I'm seeing some authorship inconsistencies that may relate to differences between editions. (Susan might have some insight into this). Also curious about the "Accessed" part -- is something set to imply that these are online resources, and if so, should it be? |
I guess the question is whether there's ever a need for some records to support different citation options than other records. It seems to me more likely to be a binary switch -- either a record is worth citing, or it's out of scope for citations -- so maybe a |
Failing because of debug output.
The only case I can think of is a legacy case where the setting is set to I think perhaps I'll remove the new
|
@crhallberg, actually, on reviewing this further, I think maybe we want to keep getSupportedCitationFormats, but add support for a new "true" option which means "all options enabled in config.ini". Then we just change all of the drivers that define an array of options to return true instead of the array, and not much else needs to change. We can still keep the default behavior where citations are disabled in the lowest-level abstract driver but get turned on later in the inheritance chain. Not to say I'm unwilling to consider other changes, but this seems like potentially the least complicated and least disruptive next step. |
I like your option, but it does break the true option in config.ini. True (config) and true (supported) will have no useful info to make a list from. // Legacy: use all supported options.
if ($formatSetting === true || $formatSetting === 'true') {
throw new \Error('"true" option for citation_formats no longer valid.');
} |
@crhallberg, I think the simple solution to the "true" option in config.ini is to simply redefine it as "enable VuFind's default set of citation options" instead of "all supported options." And we can define the default set as ALA/MLA/Chicago based on legacy behavior. |
I'm not sure why my composer.lock keeps getting off base. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the progress, @crhallberg -- a few minor comments and suggestions based on a review of the latest code. Also note that several tests are now failing; the question is whether this exposes any problems or just indicates that tests need to be updated. I haven't done any in-depth analysis yet, but let me know if you need more input!
@@ -1870,17 +1870,20 @@ related[] = "Similar" | |||
|
|||
; This setting controls which citations are available. | |||
; | |||
; Use default formats (APA, MLA, Chicago): | |||
; citation_formats = true | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extreme nitpick: I'd put a semi-colon at the start of this blank line to indicate that the comment continues below; otherwise it seems to me like we're moving on to a different topic.
@@ -343,23 +343,42 @@ public function setExtraDetail($key, $val) | |||
} | |||
|
|||
/** | |||
* Expand deprecated abbreviations to | |||
* @param string $abbr ALA, MLA, Chicago, Vancouver | |||
* Check for an expand deprecated abbreviations to library format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: should be "Check for and expand..."
); | ||
// Trim and convert legacy to 9.x format | ||
foreach ($formatSetting as $i => $format) { | ||
$formatSetting[$i] = trim($this->expandLegacyAbbreviation($format)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to create an anonymous function like:
$formatCleanup = function ($format) { return trim($this->expandLegacyAbbreviation($format)); }
...then you can replace the foreach loops here and below with array_map
calls, which might make the code a little more concise (as well as ensuring that format reformatting is always done consistently).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could also make the trim part of the expandLegacyAbbreviation function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, that might be simpler, and then you could array_map with the [$this, 'expandLegacyAbbreviation']
callback.
Any time there is an upstream dependency change, it's going to cause a composer.lock conflict. To resolve it, just merge upstream dev, delete composer.json, run |
A question regarding addSupportedCitationFormats: |
@mtrojan-ub, do you mean getSupportedCitationFormats rather than addSupportedCitationFormats? (Just want to be sure we're talking about the same thing). My thought is that, at the record driver level, it may still be useful to implement a mechanism that can say "no citations are supported" or "all citations are supported" or "some specific formats are supported." However, in core code, I would expect this to be "all" or "nothing." I would reserve the "some formats" option for local customization, where people may have very specific requirements for specific types of records. More commonly, I would expect the user to control available options via config.ini, and for any records of roughly bibliographic nature to support all the active options. Does that make any sense? |
Yeah that makes sense, thanks for the explanation! |
Drivers wouldn't need to know how to format the citation, but it would be up to the driver developer to know "we can't get publication dates so there's a smaller subset of formats we can support". |
{ | ||
"type": "package", | ||
"package": { | ||
"name": "citation-style-language/styles-distribution", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anyone know why this repository hasn't updated in almost two years, while the repository it is derived from (https://github.com/citation-style-language/styles) seems to update very frequently? I just want to be sure we're not locking ourselves into something that's outdated/unsupported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, an answer -- see seboettg/citeproc-php#123 -- I think we need to make updates here.
@@ -716,7 +716,7 @@ Message From Sender = "Message From Sender" | |||
Metadata Prefix = "Metadata Prefix" | |||
Microfilm = "Microfilm" | |||
MLA Citation = "MLA Citation" | |||
MLA Edition Citation = "MLA (8th ed.) Citation" | |||
MLA Edition Citation = "MLA (9th ed.) Citation" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we really change editions? If so, we need to update all the other languages too.
@@ -72,6 +100,7 @@ | |||
"phing/phing": "2.17.3", | |||
"ppito/laminas-whoops": "2.2.0", | |||
"scssphp/scssphp": "1.6.0", | |||
"seboettg/citeproc-php": "2.1.6", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite an old version now, but doing a straight upgrade breaks everything -- we'll have to investigate why.
@@ -263,16 +265,215 @@ function (string $value) use ($callables): string { | |||
*/ | |||
public function getCitation($format) | |||
{ | |||
$data = $this->getDataCSL(); | |||
|
|||
$locale = 'en'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we default to the default locale from config.ini instead of always assuming 'en'?
@crhallberg / @mtrojan-ub, I've merged the latest dev code into this branch and reviewed the state of this PR in preparation for today's Community Call. There is still a lot of work to do here -- I've added a few new comments and some TODO checkboxes to help keep track of outstanding issues. |
$locale = 'en'; | ||
|
||
try { | ||
// will fail during unit tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this try-catch only exists to work around a testing issue, maybe we need to either do better mocking in the test, or a more targeted check for when we should skip this logic.
Merged with dev but tests are still breaking (as noted in the summary). |
Quick update to note that I've once again merged dev into this branch and resolved conflicts. There's still a lot of work that needs to be done, as reflected by the TODO list above. Unfortunately, I don't currently have the bandwidth to move this forward. |
The TODO section is very clear but I'm in the same boat here. |
Initial proof of concept derived from work by Mathieu Grimault on VUFIND-889.
TODO: