This repository is part of the Find Case Law project at The National Archives. For more information on the project, check the documentation.
When a file is uploaded to the S3 bucket and ends in .docx, create a PDF file at the same key (but ending .pdf instead). Uses LibreOffice to perform the conversion.
Warning: This ECS task performs no filetype checking whatsoever on the input document.
The service is typically triggered by S3 bucket notifications when .docx files are uploaded, but the actual Python code blindly:
- Downloads whatever file the SQS message points to
- Saves it locally with a
.docxextension (regardless of actual type) - Runs
soffice --convert-to pdfon it without validation - Uploads whatever LibreOffice outputs as
.pdf
Implications:
- If you process a PDF: LibreOffice will accept it and output a PDF (potentially re-converted or degraded)
- If you process other filetypes: Behavior depends entirely on whether LibreOffice's conversion command succeeds or errors. Some formats (
.rtf,.odt,.txt) may convert successfully, others will fail silently with no PDF output.
The service completely trusts the S3 notification configuration and SQS messages. Be careful when manually sending messages to the queue or modifying S3 notification rules.
The main branch is automatically deployed to staging with each commit.
To deploy to production:
- Create a new release.
- Set the tag and release name to
vX.Y.Z, following semantic versioning. - Publish the release.
- Automated workflow will then force-push that release to the
productionbranch, which will then be deployed to the production environment.
You can republish a PDF by uploading the PDF again, or by sending JSON of the form:
{
"Records": [
{
"s3": {
"bucket": {
"name": "tna-caselaw-assets"
},
"object": {
"key": "eat/2022/1/eat_2022_1.docx",
"eTag": "fa2ef6e8abadbd5cc5cedf3f32834f1f"
}
}
}
]
}to the Send and Receive Messages page of the Simple Queuing System on AWS.
The script scripts/create_json_for_bulk_pdf_regeneration will make that JSON file for you, if you want to remake every PDF that's backed by a docx file.
(The eTag is arbitrary but should be a sensible filename fragment, no / )
- Copy
.env.exampleto.env - From ds-caselaw-ingester, run
docker compose upto launch the Localstack container - From ds-caselaw-pdfconversion, run
scripts/setup-localstack.shto set up the queues etc. - From ds-caselaw-pdfconversion, run
docker compose up --buildto launch the LibreOffice container (--buildwill ensure the converter script is in the docker container)
You might want to look at the localstack S3 bucket
The project contains both unit tests and integration tests:
- Unit tests: Basic functionality testing
- Integration tests: Full PDF conversion testing (requires LibreOffice)
We provide a convenience script that can run tests either locally or in Docker:
# Run unit tests locally using Poetry
./run-tests.sh local
# Run all tests in Docker (recommended)
./run-tests.sh docker
# Run Docker tests with a specific tag (useful for CI/CD)
./run-tests.sh docker my-feature-branch
# Run specific test files
./run-tests.sh docker -- -k test_unit.py # Run only unit tests
./run-tests.sh docker -- -k test_integration.py # Run only integration testsThe Docker approach is recommended as it:
- Matches the CI environment exactly
- Includes all required dependencies (LibreOffice, fonts, etc.)
- Ensures consistent test environment across all developers
- Uses Docker layer caching for faster builds
- Automatically cleans up containers after test runs
-
Local mode (
./run-tests.sh local):- Runs unit tests only
- Uses local Poetry installation
- Quick for development
- No LibreOffice required
-
Docker mode (
./run-tests.sh docker):- Runs both unit and integration tests
- Builds and uses a Docker image
- Includes LibreOffice for PDF conversion
- Uses buildx caching when available
- Automatically removes containers after testing
Having run Local Setup tasks above, you should see output like:
Downloading judgment.docx
...
Uploaded judgment.pdfon startup.
upload_file will upload a docx, which should generate a PDF
upload_custom_pdf will upload a tagged PDF, which should cause upload_file to fail with judgment.pdf is from custom-pdfs, not replacing
upload_named_pdf will upload a docx of your choosing
The output of fc-match gibberish should be something like
Times_New_Roman.ttf: "Times New Roman" "Regular"