Skip to content

Conversation

@SomeoneInParticular
Copy link
Collaborator

This PR is mainly to resolve a number of potential issues and code quality concerns in the current state of the repository. In particular, this PR aims to fix the following (among other minor fixes):

  • Over-reliance on the os package, which can be unsafe (or, at least, less safe than pathlib for what its being used for)
  • Output fails to work as expected, often crashing out in common use cases (i.e. relative output paths, see Designating where to place output fails in many cases #26)
  • Many hard-coded variables in the code itself, rather than in separate configuration files, making the code overly specific to only out use-case
  • Inefficient use of Pandas utilities (making code take much longer than it could, or being harder to debug/generalize)
  • Overzealous imports, which could result in global variable pollution (and by extension, the potential for nightmare bugs)

Once this pass is completed, the next step will be implementing automated testing to help ensure that further refactors/tests don't break existing functionality (which I am currently testing manually). Cheers!

kalum.ost added 20 commits January 14, 2025 17:41
…ion types; it will now process all available types when this occurs.
…io.test.docx") and be placed in directories that do not initially exist at runtime, through the power of PathLib!
…for the NeuroPoly wiki) is now placed in the same place as everything else.
…ueries doesn't exist; it will now just notify the user with a warning instead + fix for user requesting 0 publication types
…ollution in preparation for future refactoring
…now you can mix both files and raw identifiers if you want! This is part of an overall CLI refactor, to preparation for automated testing.
…tions can be called *without* needing a shell to call them!
…(Google Doc or local) into its own CLI interface
@SomeoneInParticular
Copy link
Collaborator Author

SomeoneInParticular commented Jan 21, 2025

Change Log:

New: Custom CLI added for formatting the contents of a bibliography.

  • This has NOT replaced the original CLI used previously, though it does replicate a large amount of its functionality. The old CLI will be deprecated (and eventually removes) once all of its functionality has been replicated in other CLI managers.
  • Accessible via calling bibeasy_format. See bibeasy_format -h for detailed use-case examples
  • Currently only outputs into Word or our custom NeuroPoly format, set up to be extended going forward though
  • Changes to input file handling:
    • A Google Sheet cache is now (re)generated when you give the URL explicitly. For example, the following command will generate a cache using the Google Sheet data present at the provided URL:
      bibeasy_format -i https://docs.google.com/spreadsheets/d/1dEUBYf17hNM22dqV4zx1gsh3Q-d97STnRB4q7p9nQ54/edit?gid=566297787#gid=566297787
    • Alternatively, you can specify an Excel file you have on your local desktop instead, if you have a local spreadsheet you want to use instead:
      bibeasy_format -i "path/to/excel/spreadsheet.xlsx"
    • If no input is given, the program will try to use an existing cache. Otherwise it will error out with a message explaining why the program could not proceed.
  • Changes to output file designation:
    • The user no longer needs to define an output file; instead, the user may optionally use either or both of the following to define where output is placed and how it is named:
      • -o/--output: A directory where output files will be placed. Defaults to the current directory if not specified.
      • -l.--label: A label which will be appended to the front of all output files within the same directory. Defaults to bibeasy_out if not specified.
        • In the case of multiple output files, a 'sub-label' is added after the main label to delineate between them, with a double underscore between the two (i.e. bibeasy_out__article)
  • Changes to output formatting arguments
    • The (file) format of the output now MUST be specified by the user explicitly with the -f flag
      • Currently only docx and neuro_md (our custom format for the NeuroPoly website) formats are supported.
      • The format used is used to determine the extension the resulting output file(s), rather than the user specifying it.
    • --combine has been replaced with --keep_separate
      • If the flag is used, the output files will be separated by the publication type (determined by the sheets used in the source spreadsheet)
        • This was changed as, given we want this to be a bibliography manager first, I suspect the majority of people (myself included) use the sub-sheets for organization, rather than to enforce stratification. This is trivial to reverse if I'm wrong, though
    • The citation style (currently only used in docx formats) can be specified with the --style flag.
      • Default is now APA, rather than custom. This was done under the assumption that most people wanting a bibliography want a known standardized format, rather than our current custom one.
    • The order (by year) can be made to be descending with the --reverse flag
    • A path containing a list of organization labels (for use in the NeuroPoly MD website) can be provided using the --valid_labels flag
  • Changes to output filtering arguments:
    • The --year flag can be used to specify the earliest year for which publications should be considered. This is inclusive (that is, if it is set to 2015, anything published on 2015 is included, but nothing before).
    • The --type flag can be used to specify the types of publication which should be preserved and output. The 'type' of a publication is determined by the name of it's sub-sheet within the source Excel file.
    • The --required_columns flag can be used to specify a set of columns which cannot be null in entries. Entries which lack a value in any of the provided columns will not be present in the resulting output file(s).
    • The --filters flag can be used to define a list of custom (SQL-like) query filters you want applied to the dataset prior to the results being output. For example, adding --filter "IVADO17 == 'x'" would result in the output containing only entries in which the IVADO17 columns contained an x.
      • This uses pandas.DataFrame.query under the hood, so any query string which is valid in that context should work here as well.

Changed:

  • The methods to format a DataFrame into the requested output format have been extracted from csv_to_txt_pubtype into their own unique functions
    • These are now mapped by formatting "labels" in the new CLI script and called directly. This should make the code much easier to extend with new formats going forward.
  • Parsing an Excel file into a unified DataFrame is now handled by its own utility function in the CLI (utils:read_excel) to better modularize it.
    • Unlike its precursor (gsheet:gsheet_to_df), it will only filter by the 'type' of publication, with the remaining filters and re-formatting operations being done in the new CLI script.
    • The 'type' exception exists solely because it reduces read-from-disk time (as we reduce the number of sub-sheets in the Excel spreadsheet being loaded based on the filter)

Remaining TODOS:

  • Allow you to (re)generate the Google Sheet cache without also defining an output format (and, by extension, generating new outputs)
  • Choose a default output format (i.e. docx which will be used if one is not defined by the user).
  • Allow for more customizable sorting order (i.e. by Title, or first-author then year)
  • When a labels file is specified with --valid_labels, allow the user to request that the program validates all entries in the provided Excel file fall under at least one of the label's provided.

@SomeoneInParticular
Copy link
Collaborator Author

@jcohenadad @namgo could one of you poke around with the new CLI and let me know if theres any features that were left out/broken by mistake? I don't really have a "feature list" to go off of here, so I'm just aiming to re-implement existing functionality in a new (cleaner) CLI format. Thank you in advance!

@jcohenadad
Copy link
Member

jcohenadad commented Jan 21, 2025

@SomeoneInParticular this is great-- thank you for working on this-- i tested " bibeasy -o publi.docx -st custom -r -c" and it produces the same output as it used to--

however, glancing at #28 (comment), i see that some behaviours changed, which is not great-- it means that a command i used to do in the past won't work anymore or will produce different behaviours-- i do not have the time to dive into your lengthy description so could you pls just give me the TL;DR of what is different now-- thx

also: the lab website is relying on bibeasy, so here to we need to make sure you didn't break anything-- thx-- tagging @nullnik-0 @namgo so they are aware

@SomeoneInParticular
Copy link
Collaborator Author

@jcohenadad Yes; the original bibeasy CLI should work identically to how it did before (functions, bugs, edge cases, and everything else). The changes I detailed in the change log did not replace this "main" CLI, and we can keep it around for as long as we want to avoid breaking existing scripts.

TL:DR: theirs now a second CLI, bibeasy_format, which isolates the output formatting functionality to its own call. As it stands, if we proceed with the current setup, the bibeasy command is going to be nigh impossible to maintain as more and more arguments get added. Hence, I've done this to avoid bloating the "main" CLI with additional arguments, in anticipation for extended functionality later.

To use your command as an example (bibeasy -o publi.docx -st custom -r -c), the equivalent using the new CLI would be:

bibeasy_format -l publi -f docx --style custom --reverse

AGAIN: This new CLI has not replaced the old one, and we can change existing scripts over to this new CLI at our leisure without anything breaking. It (and any other CLIs we add later for other functionality, such as CCV parsing) exist solely to get ahead of the bloat problem before technical debt makes it too much a nightmare to untangle.

@SomeoneInParticular
Copy link
Collaborator Author

SomeoneInParticular commented Jan 22, 2025

Note as well; the CLI changes are not set in stone, and were done mainly based on my intuition on what good "defaults" would be. Its easy to change them if I was wrong, or to make them "closer" to the original CLI. Just say the word

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants