This is a workspace for processing documents for FairData and FairCopy. There are three scripts that can be run:
merge_pages: Combine single-page PDFs or images into multipage PDF documents.extract_images: Create individual image files from a IIIF manifest.create_tei: Create valid TEI documents out of IIIF manifests and (optionally) corresponding translations in the form of.docxfiles.
More detail on each of these commands can be found below.
This script library is designed to enable the creation of TEI documents for use in the FairCopy editor, using as input IIIF manifests generated for Media records in FairData and (optionally) transcription files generated by the LEO transcription tool. The basic workflow is as follows:
- Create the PDF document. This step may be done outside of this workspace, if you already have a PDF in your desired form. If you wish to combine individual page files into a PDF, place those files within folders inside the
merging_inputdirectory and run themerge_pagescommand. The output PDF will appear in themerging_outputfolder. - Upload your PDF as a
Mediarecord in FairData. This will give you a IIIF manifest URL. - Create a
Documentrecord in FairData and link the uploaded PDF to it. This document record is where the relational data for your document will be recorded. The unique identifier associated to this document record (not the identifier of the associated Media record) is thefairdataIDfor this document. - If needed, extract image files from the IIIF manifest. If individual image files are necessary for processing in LEO, you can use the
extract_images -u <fairdataID>script to extract image files from the IIIF manifest. These files will be generated in theimage_outputfolder. - Generate transcription
.docxfile using LEO. Once this file is generated, put it in thetrancriptionsfolder. - Run
create_tei -m <manifest URL> -u <fairdataID> -x <xml_id> -n <name> -t <transcription file>to generate TEI. Thexml_idshould be a valid XML ID that you wish to assign this document. Once generated, the TEI file will appear in theTEIfolder. - Import the TEI file to FairCopy. You can now download the TEI file and import it to the FairCopy editor for further editing and annotation.
- Import data from FairCopy Cloud into FairData. Once you've marked up the TEI file, you can publish it using FairCopy. Going back to the
Documentrecord on FairData, add the XML ID field and import from FairCopy Cloud. This will add any people or places you've marked up in the TEI as related records.
The process outlined above is for working with single documents at a time. You can also process documents in bulk; for this you will need some facility with FairData bulk imports, and access to the DigitalOcean juel-box bucket. The process is as follows:
- Create Merged PDFs. Place folders containing single page PDFs to be merged in th
merging_inputfolder. The name of each folder should be the desired XML ID of the resulting document. It should be alphanumeric with no spaces. Runmerge_pages -c. The-cflag tells the script to create CSV assets that can be directly imported to FairData. - Upload PDFs to DigitalOcean. Running
merge_pages -cwill output instructions in the console for doing this step using therclonecommand line utility; alternatively you can copy the files from the created timestamped directory insidemerging_outputand upload them via CyberDuck or any other preferred interface. The files should end up in a folder/processed/[timestamp]in the DigitalOcean bucket. - Import to FairData. Once the files are on DigitalOcean, you can go ahead and import the
.zipfile that was created in thecsvsfolder to the FairData project. - Create Transcriptions. The filenames of the transciption files must match the XML IDs of the documents (aka the folder names from step 1). Place transcription files directly in the
transcriptionsfolder. - Create TEI. Run
create_tei -f [timestamp]/items.csv, where the timestamp corresponds to the path to theitemsCSV file created in step 1. This will create TEI documents with aligned transcriptions in theTEIfolder. These can then be imported to a FairCopy project.
For each document you wish to create, make a folder inside merging_input, and in that folder put the individual page files that you wish to merge. Make sure they are appearing in the correct order. The name of the folder should be the name you want the output PDF to have. The page files should either all be .pdf files or all be .jpg files.
Once you've added the files, simply run:
merge_pages
The output multipage documents will appear in the merging_output folder, from which you can download them.
For this script you need either a fairdataID for the Document record (not Media record) you wish to process, or a manifest URL. In the first case, run:
extract_images -u <fairdataID>
In the second case, run:
extract_images -m <manifest>
The extracted images will appear in the image_output folder.
Alternatively, you can pass a CSV file from the csvs folder as an argument instead, if you wish to extract images from multiple documents in one go. The CSV must have either a column called fairdataID or a column called manifest. (If both columns are present, the manifest URL will take precedence.) In this case, run:
extract_images -f <filename.csv>
For each document you wish to process, you should collect the following information:
manifest: Required. A URL for the IIIF manifest containing the images of your document. If you are using FairData, this URL can be found by clicking "View IIIF" on the media record for your document.xmlid: Required. The desired XML ID of the resulting TEI document.title: The document title.fairdataID: If relevant, the unique FairData identifier for the document. Note that unlike the manifest this should be found by navigating to the relevant record in theDocumentsmodel and copying the identifier from the upper right corner.transcription: The full filename (including the.docxextension) of the transcription for the document. This file should be placed in thetranscriptionsfolder of this repo.
If you have the above information for a single document, and you have uploaded the transcription file to the transcriptions folder of this repository, then you can process your document with the following command from the root folder of the repository:
create_tei -m <manifest url> -x <xmlid> -n <title> -u <fairdataID> -t <transcription>
Be sure to enclose your arguments in quotation marks if they contain spaces. For example, if you have a document called "My Excellent Archival Document" and you've uploaded the transcription file "MEAD Transcript.docx" to the transcriptions folder, your command might look like
create_tei -m https://someiiifserver.com/abc/manifest -x MY_EX_ARC -n "My Excellent Archival Document" -t "MEAD Transcript.docx"
If the processing finishes successfully, your TEI document will be generated with the name <xmlid>.xml in the TEI folder. You can then import that document into FairCopy for further editing and annotating.
If you have multiple files you wish to process at once, create a CSV with columns manifest, xmlid, title, fairdataID, transcription and a row for each document you wish to process. Your CSV should have a header row. Upload the CSV into the csvs folder of this repository, and then run:
create_tei -f <filename>
where filename is the name of your CSV, including the extension .csv.