Skip to content

performant-software/JUEL-faircopy-scripts

Repository files navigation

faircopy-scripts

This is a workspace for processing documents for FairData and FairCopy. There are three scripts that can be run:

  1. merge_pages: Combine single-page PDFs or images into multipage PDF documents.
  2. extract_images: Create individual image files from a IIIF manifest.
  3. create_tei: Create valid TEI documents out of IIIF manifests and (optionally) corresponding translations in the form of .docx files.

More detail on each of these commands can be found below.

General Document Workflow

This script library is designed to enable the creation of TEI documents for use in the FairCopy editor, using as input IIIF manifests generated for Media records in FairData and (optionally) transcription files generated by the LEO transcription tool. The basic workflow is as follows:

  1. Create the PDF document. This step may be done outside of this workspace, if you already have a PDF in your desired form. If you wish to combine individual page files into a PDF, place those files within folders inside the merging_input directory and run the merge_pages command. The output PDF will appear in the merging_output folder.
  2. Upload your PDF as a Media record in FairData. This will give you a IIIF manifest URL.
  3. Create a Document record in FairData and link the uploaded PDF to it. This document record is where the relational data for your document will be recorded. The unique identifier associated to this document record (not the identifier of the associated Media record) is the fairdataID for this document.
  4. If needed, extract image files from the IIIF manifest. If individual image files are necessary for processing in LEO, you can use the extract_images -u <fairdataID> script to extract image files from the IIIF manifest. These files will be generated in the image_output folder.
  5. Generate transcription .docx file using LEO. Once this file is generated, put it in the trancriptions folder.
  6. Run create_tei -m <manifest URL> -u <fairdataID> -x <xml_id> -n <name> -t <transcription file> to generate TEI. The xml_id should be a valid XML ID that you wish to assign this document. Once generated, the TEI file will appear in the TEI folder.
  7. Import the TEI file to FairCopy. You can now download the TEI file and import it to the FairCopy editor for further editing and annotation.
  8. Import data from FairCopy Cloud into FairData. Once you've marked up the TEI file, you can publish it using FairCopy. Going back to the Document record on FairData, add the XML ID field and import from FairCopy Cloud. This will add any people or places you've marked up in the TEI as related records.

Bulk Processing

The process outlined above is for working with single documents at a time. You can also process documents in bulk; for this you will need some facility with FairData bulk imports, and access to the DigitalOcean juel-box bucket. The process is as follows:

  1. Create Merged PDFs. Place folders containing single page PDFs to be merged in th merging_input folder. The name of each folder should be the desired XML ID of the resulting document. It should be alphanumeric with no spaces. Run merge_pages -c. The -c flag tells the script to create CSV assets that can be directly imported to FairData.
  2. Upload PDFs to DigitalOcean. Running merge_pages -c will output instructions in the console for doing this step using the rclone command line utility; alternatively you can copy the files from the created timestamped directory inside merging_output and upload them via CyberDuck or any other preferred interface. The files should end up in a folder /processed/[timestamp] in the DigitalOcean bucket.
  3. Import to FairData. Once the files are on DigitalOcean, you can go ahead and import the .zip file that was created in the csvs folder to the FairData project.
  4. Create Transcriptions. The filenames of the transciption files must match the XML IDs of the documents (aka the folder names from step 1). Place transcription files directly in the transcriptions folder.
  5. Create TEI. Run create_tei -f [timestamp]/items.csv, where the timestamp corresponds to the path to the items CSV file created in step 1. This will create TEI documents with aligned transcriptions in the TEI folder. These can then be imported to a FairCopy project.

Script Details

merge_pages

For each document you wish to create, make a folder inside merging_input, and in that folder put the individual page files that you wish to merge. Make sure they are appearing in the correct order. The name of the folder should be the name you want the output PDF to have. The page files should either all be .pdf files or all be .jpg files.

Once you've added the files, simply run:

merge_pages

The output multipage documents will appear in the merging_output folder, from which you can download them.

extract_images

For this script you need either a fairdataID for the Document record (not Media record) you wish to process, or a manifest URL. In the first case, run:

extract_images -u <fairdataID>

In the second case, run:

extract_images -m <manifest>

The extracted images will appear in the image_output folder.

Using a CSV

Alternatively, you can pass a CSV file from the csvs folder as an argument instead, if you wish to extract images from multiple documents in one go. The CSV must have either a column called fairdataID or a column called manifest. (If both columns are present, the manifest URL will take precedence.) In this case, run:

extract_images -f <filename.csv>

create_tei

For each document you wish to process, you should collect the following information:

  • manifest: Required. A URL for the IIIF manifest containing the images of your document. If you are using FairData, this URL can be found by clicking "View IIIF" on the media record for your document.
  • xmlid: Required. The desired XML ID of the resulting TEI document.
  • title: The document title.
  • fairdataID: If relevant, the unique FairData identifier for the document. Note that unlike the manifest this should be found by navigating to the relevant record in the Documents model and copying the identifier from the upper right corner.
  • transcription: The full filename (including the .docx extension) of the transcription for the document. This file should be placed in the transcriptions folder of this repo.

Processing a single file

If you have the above information for a single document, and you have uploaded the transcription file to the transcriptions folder of this repository, then you can process your document with the following command from the root folder of the repository:

create_tei -m <manifest url> -x <xmlid> -n <title> -u <fairdataID> -t <transcription>

Be sure to enclose your arguments in quotation marks if they contain spaces. For example, if you have a document called "My Excellent Archival Document" and you've uploaded the transcription file "MEAD Transcript.docx" to the transcriptions folder, your command might look like

create_tei -m https://someiiifserver.com/abc/manifest -x MY_EX_ARC -n "My Excellent Archival Document" -t "MEAD Transcript.docx"

If the processing finishes successfully, your TEI document will be generated with the name <xmlid>.xml in the TEI folder. You can then import that document into FairCopy for further editing and annotating.

Processing multiple files

If you have multiple files you wish to process at once, create a CSV with columns manifest, xmlid, title, fairdataID, transcription and a row for each document you wish to process. Your CSV should have a header row. Upload the CSV into the csvs folder of this repository, and then run:

create_tei -f <filename>

where filename is the name of your CSV, including the extension .csv.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published