Skip to content

Use Citation Style Language (CSL) in Citation view helper. #1324

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 23 commits into
base: dev
Choose a base branch
from

Conversation

demiankatz
Copy link
Member

@demiankatz demiankatz commented Feb 25, 2019

Initial proof of concept derived from work by Mathieu Grimault on VUFIND-889.

TODO:

  • Get authors working correctly.
  • Clean up trailing punctuation issues in titles and publishers (and elsewhere, if necessary)
  • Determine the best way to dynamically determine locale (currently hard-coded)
  • Determine the best way to allow configuration of arbitrary CSL stylesheets (currently Vancouver is hard-coded as an example)
  • Update to latest release of seboettg/citeproc-php (and make sure styles are being pulled from an appropriate repo)
  • Add test records and custom import properties to fully exercise all the fields involved in citations, for easier/more comprehensive testing. This should include not just monographic items but also journal articles that utilize VuFind's "container" fields.
  • Update test class
  • Remove unused functions
  • Remove debug display
  • Resolve VUFIND-889 when finished

- Derived from work by Mathieu Grimault on VUFIND-889
@demiankatz demiankatz changed the base branch from master to dev August 3, 2020 14:47
@crhallberg
Copy link
Contributor

crhallberg commented Dec 7, 2021

I was thinking for configuration, something like this in config.ini (Record section)

; This setting controls which citations are available; 
; comment all lines to disable citations

; use ini array syntax to activate only selected formats.
; for available options, 
; - see the repository: https://github.com/citation-style-language/styles 
; - see the Citation Styles website: https://citationstyles.org/authors/
citation_formats[] = apa:APA Citation                                      ; default
citation_formats[] = chicago-annotated-bibliography:Chicago Style Citation ; default
citation_formats[] = modern-language-association:MLA Citation              ; default
citation_formats[] = acta-anaesthesiologica-scandinavica:Custom Citation Example

@crhallberg
Copy link
Contributor

I (maybe foolishly) think it would be quite simple to add an area where users can enter a CSL to render but the names aren't very straightforward and I think a dropdown or an autocomplete might leave something to be desired...

Copy link
Member Author

@demiankatz demiankatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for moving this forward, @crhallberg -- I've left a few comments on the latest changes. I also merged in the latest dev and resolved composer.json/lock conflicts.

return trim($text, " \n\r\t\v\0,:;/");
}

protected function removeDateRange($name)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the existing cleanNameDates function instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanNameDates didn't cover all the cases. I do intend to clean unused functions out when the new config is more integrated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be interested to know what cases aren't covered, because I think this is exactly the intended purpose for cleanNameDates. Maybe we need to fix that one rather than add a second function.

return preg_replace('/\d+[ ]*\-[ ]*\d*/', '', $name);
}

protected function nameToGivenFamily($name)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is going to need some more logic added. What if the name is already in "First Last" format? What if there is a suffix like ", Jr." or a title like ", Mrs."?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am looking at the other functions in this class, and they don't seem to cover all those cases. Doesn't mean we shouldn't though! I'll see if I can find a more robust parser.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a lot of the other functions are already assuming names are formatted in a particular way, and most of the name processing is designed to fix non-conformant strings. But if we need to parse out specific elements, it can get complicated.

Copy link
Contributor

@crhallberg crhallberg Dec 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a composer library: https://github.com/joshfraser/PHP-Name-Parser.

We would have to convert LastName, FirstName to FirstName LastName it seems.

Their tests actually have a list of unhandled cases: https://github.com/joshfraser/PHP-Name-Parser/blob/master/tests/FullNameParserTest.php#L559

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the lack of "Last, First" support is disappointing, since we'd have to write our own parser to deal with complex names with multiple components (e.g. "Miller, Alex. McVeigh, Mrs." or "Smith, John, III") before we could use the third-party parser.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one will likely work better for our purposes: https://github.com/theiconic/name-parser. I've found a few more robust libraries in other languages, but I don't know if it's worth the conversion effort. I see a lot of unit tests in my future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That library does look potentially like a better fit, though the lack of activity in two years is somewhat concerning.

@@ -192,6 +192,47 @@ public function getCitation($format)
return '';
}

protected function trimPunctuation($text)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the existing stripPunctuation method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stripPunctuation was removing all punctuation when I just wanted to trim it from the ends.

  • stripPunctuation('J. J. J. Smith \.') == 'J J J Smith'
  • trimPunctuation('J. J. J. Smith \.') => 'J. J. J. Smith'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? The function comment is "Strip unwanted punctuation from the right side of a string." The code doesn't look to me like it should be doing anything other than trimming the end. If anything, it looks like it might not be aggressive enough for some scenarios.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you're right. I misread the functionality. We'll keep only one at the end, regardless of which we go with.

@@ -223,56 +262,71 @@ public function getDataCSL()
default:
$item['type'] = 'book';
}
// subtitle -> shortTile
if (!empty($this->details['subtitle'])) {
$item['shortTitle'] = $this->details['subtitle'];
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapping subtitle to short title seems like the wrong thing to do, so I understand why you removed it... but do we need to use subtitle in a different field, or is it not needed at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see support for subtitle in the schema. The schema has a few strange choices in it like that.

[
'ISSN' => $this->driver->getISSNs(),
'volume' => $this->driver->getContainerIssue(),
'volume-title' => $this->driver->getContainerVolume(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is right? What if there's a container title, container volume number and container issue number, like "New York Weekly, v. 54, no. 7"? (I can't remember exactly how this is represented in the driver object, but we should double check that we're doing it right here).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be worth having a quick one on one and go through the spec to make sure I'm seeing all the options.

These are the confusing/possibly misimplemented bits:

  • subtitle
  • volume
  • dates

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be a good idea to create an articles.mrc file in the standard tests/data directory along with an articles.mrc.properties file that adds the article-specific indexing rules (see relevant comments in marc_local.properties). This way, we could have some baseline article data in our standard test index to work with... would make integration testing easier as well as comparison between new and old code. Would you like me to try to add something here? I agree that a conversation would be useful, but I think it will be easier when we have realistic test data to look at as we talk.

@demiankatz
Copy link
Member Author

Also, @crhallberg, I think your configuration suggestion makes sense -- if we add array support to the citation_formats setting it will improve readability if things get complicated. We should probably retain support for a comma-separated string for backward compatibility, though. The other question is whether we retain the legacy options or force use of CSL. We could, in theory, add a rule that if there's a colon in the citation name in the configuration, that means we're using CSL and mapping the CSL name to a human-readable name... and if there's no colon, it's a legacy rule. But maybe maintaining legacy is not worth it. (It would be nice at least in the short term so that we could compare the output of our old code and the new CSL code).

Regarding a custom control, I wonder if there's a database of CSL sheets to human-readable names somewhere that we could take advantage of... and in general, whether or not we can automatically provide a long list, we should think about a UI for situations where there are dozens of formats turned on, since that's now theoretically possible.

@crhallberg
Copy link
Contributor

I don't think maintaining legacy in this case would be difficult because we have very limited citation formats. I think we could map them behind the scenes.

@crhallberg
Copy link
Contributor

I converted to the new config format and currently have a side-by-side comparison to make sure we're using the style spec correctly:
image

@crhallberg
Copy link
Contributor

One thing that I did which I want feedback on: deprecated getSupportedCitationFormats.

With so many formats now supported, it seems insane to try and keep this list up to date. I did add a getDefaultCitationFormats. That way we can use that information as a default to support citation_formats = true.

@demiankatz
Copy link
Member Author

Thanks, @crhallberg, this is interesting! I'm curious why unwanted dates are slipping through in the legacy code (I thought we had handled that pretty well, so I wonder if you've found a new edge case in your test record). I also wonder whether the CSL is for the stated versions of the various standards, since I'm seeing some authorship inconsistencies that may relate to differences between editions. (Susan might have some insight into this). Also curious about the "Accessed" part -- is something set to imply that these are online resources, and if so, should it be?

@demiankatz
Copy link
Member Author

One thing that I did which I want feedback on: deprecated getSupportedCitationFormats.

With so many formats now supported, it seems insane to try and keep this list up to date. I did add a getDefaultCitationFormats. That way we can use that information as a default to support citation_formats = true.

I guess the question is whether there's ever a need for some records to support different citation options than other records. It seems to me more likely to be a binary switch -- either a record is worth citing, or it's out of scope for citations -- so maybe a supportsCitations() method would be a more sensible successor to getSupportedCitationFormats(). But maybe there are use cases I'm not thinking of where greater granularity is needed. I definitely agree that we don't want to hard-code a long list of citation options into record drivers, but I'd have to do a deeper study to come up with a stronger opinion on how best to move forward.

Failing because of debug output.
@crhallberg
Copy link
Contributor

The only case I can think of is a legacy case where the setting is set to true.

I think perhaps I'll remove the new getDefaultCitationFormats and just fall back to getSupportedCitationFormats() when true is encountered.

supportsCitations would definitely be useful for drivers that can't provide the basic data. Default to true in AbstractBase?

@demiankatz
Copy link
Member Author

@crhallberg, actually, on reviewing this further, I think maybe we want to keep getSupportedCitationFormats, but add support for a new "true" option which means "all options enabled in config.ini". Then we just change all of the drivers that define an array of options to return true instead of the array, and not much else needs to change. We can still keep the default behavior where citations are disabled in the lowest-level abstract driver but get turned on later in the inheritance chain. Not to say I'm unwilling to consider other changes, but this seems like potentially the least complicated and least disruptive next step.

@crhallberg
Copy link
Contributor

I like your option, but it does break the true option in config.ini. True (config) and true (supported) will have no useful info to make a list from.

// Legacy: use all supported options.
if ($formatSetting === true || $formatSetting === 'true') {
    throw new \Error('"true" option for citation_formats no longer valid.');
}

@demiankatz
Copy link
Member Author

@crhallberg, I think the simple solution to the "true" option in config.ini is to simply redefine it as "enable VuFind's default set of citation options" instead of "all supported options." And we can define the default set as ALA/MLA/Chicago based on legacy behavior.

@crhallberg
Copy link
Contributor

I'm not sure why my composer.lock keeps getting off base.

Copy link
Member Author

@demiankatz demiankatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the progress, @crhallberg -- a few minor comments and suggestions based on a review of the latest code. Also note that several tests are now failing; the question is whether this exposes any problems or just indicates that tests need to be updated. I haven't done any in-depth analysis yet, but let me know if you need more input!

@@ -1870,17 +1870,20 @@ related[] = "Similar"

; This setting controls which citations are available.
;
; Use default formats (APA, MLA, Chicago):
; citation_formats = true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extreme nitpick: I'd put a semi-colon at the start of this blank line to indicate that the comment continues below; otherwise it seems to me like we're moving on to a different topic.

@@ -343,23 +343,42 @@ public function setExtraDetail($key, $val)
}

/**
* Expand deprecated abbreviations to
* @param string $abbr ALA, MLA, Chicago, Vancouver
* Check for an expand deprecated abbreviations to library format.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: should be "Check for and expand..."

);
// Trim and convert legacy to 9.x format
foreach ($formatSetting as $i => $format) {
$formatSetting[$i] = trim($this->expandLegacyAbbreviation($format));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to create an anonymous function like:

$formatCleanup = function ($format) {
    return trim($this->expandLegacyAbbreviation($format));
}

...then you can replace the foreach loops here and below with array_map calls, which might make the code a little more concise (as well as ensuring that format reformatting is always done consistently).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could also make the trim part of the expandLegacyAbbreviation function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, that might be simpler, and then you could array_map with the [$this, 'expandLegacyAbbreviation'] callback.

@demiankatz
Copy link
Member Author

I'm not sure why my composer.lock keeps getting off base.

Any time there is an upstream dependency change, it's going to cause a composer.lock conflict. To resolve it, just merge upstream dev, delete composer.json, run composer install and commit the new composer.lock.

@mtrojan-ub
Copy link
Contributor

A question regarding addSupportedCitationFormats:
Is it still viable to have a capability check like this in the RecordDriver? Due to the huge variety of possible CSL templates, it might not be possible to have application logic for all of them. Would it be better to actually execute a template to see whether its possible, or to deprecate this kind of capability checking?

@demiankatz
Copy link
Member Author

@mtrojan-ub, do you mean getSupportedCitationFormats rather than addSupportedCitationFormats? (Just want to be sure we're talking about the same thing).

My thought is that, at the record driver level, it may still be useful to implement a mechanism that can say "no citations are supported" or "all citations are supported" or "some specific formats are supported." However, in core code, I would expect this to be "all" or "nothing." I would reserve the "some formats" option for local customization, where people may have very specific requirements for specific types of records. More commonly, I would expect the user to control available options via config.ini, and for any records of roughly bibliographic nature to support all the active options.

Does that make any sense?

@mtrojan-ub
Copy link
Contributor

Yeah that makes sense, thanks for the explanation!

@crhallberg
Copy link
Contributor

Drivers wouldn't need to know how to format the citation, but it would be up to the driver developer to know "we can't get publication dates so there's a smaller subset of formats we can support".

{
"type": "package",
"package": {
"name": "citation-style-language/styles-distribution",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anyone know why this repository hasn't updated in almost two years, while the repository it is derived from (https://github.com/citation-style-language/styles) seems to update very frequently? I just want to be sure we're not locking ourselves into something that's outdated/unsupported.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, an answer -- see seboettg/citeproc-php#123 -- I think we need to make updates here.

@@ -716,7 +716,7 @@ Message From Sender = "Message From Sender"
Metadata Prefix = "Metadata Prefix"
Microfilm = "Microfilm"
MLA Citation = "MLA Citation"
MLA Edition Citation = "MLA (8th ed.) Citation"
MLA Edition Citation = "MLA (9th ed.) Citation"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we really change editions? If so, we need to update all the other languages too.

@@ -72,6 +100,7 @@
"phing/phing": "2.17.3",
"ppito/laminas-whoops": "2.2.0",
"scssphp/scssphp": "1.6.0",
"seboettg/citeproc-php": "2.1.6",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite an old version now, but doing a straight upgrade breaks everything -- we'll have to investigate why.

@@ -263,16 +265,215 @@ function (string $value) use ($callables): string {
*/
public function getCitation($format)
{
$data = $this->getDataCSL();

$locale = 'en';
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we default to the default locale from config.ini instead of always assuming 'en'?

@demiankatz
Copy link
Member Author

@crhallberg / @mtrojan-ub, I've merged the latest dev code into this branch and reviewed the state of this PR in preparation for today's Community Call. There is still a lot of work to do here -- I've added a few new comments and some TODO checkboxes to help keep track of outstanding issues.

$locale = 'en';

try {
// will fail during unit tests
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this try-catch only exists to work around a testing issue, maybe we need to either do better mocking in the test, or a more targeted check for when we should skip this logic.

@crhallberg
Copy link
Contributor

Merged with dev but tests are still breaking (as noted in the summary).

@demiankatz
Copy link
Member Author

Quick update to note that I've once again merged dev into this branch and resolved conflicts. There's still a lot of work that needs to be done, as reflected by the TODO list above. Unfortunately, I don't currently have the bandwidth to move this forward.

@crhallberg
Copy link
Contributor

The TODO section is very clear but I'm in the same boat here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants