woolworm (Pre-Alpha State)

Hello Northwestern Digitization team (and anyone else who may be following along), welcome to woolworm, your new (hopefully) one-stop shop for digitization. I have attempted to abstract as much of the intricacies of image transformation in python. At least to the best of my ability. While we are working on this grant, I will be working on build automation and a CLI for you all so that it can be even easier to use. The point of this repo is in case I die, it can be developed and such. Here is my current feature list, where I am open to suggestions or requests, because I like this sort of thing:

Road to v0.1.0

Road to v1.0.0

Pipelines
- Image processing
- OCR (do we need a pipeline for this? It is a single function)
- HathiTrust (Migrated Brendan's Ruby script to python)
- ???
- Profit
CLI (To be done later)
Figure out how the hell I publish a python package

Automation, supercomputing interfacing, remote directories will be handled in a different repository. This is to track one step of the data science process: data cleaning.

Prerequisites

You will want to familiarize yourself with the absolute basics of calling object-methods. If you want to use any LLM models, you will need to install Ollama. Feel free to contact me if you need assistance in setting up Ollama.

Quickstart

If you are extremely impatient, you can get started with two lines of code

from woolworm import Woolworm

Woolworm.Pipelines.process_image("inputfilename.jpg", "outputfilename.jpg")

In the backend, it looks like this. You can find this code in the cookbook directory

from woolworm import Woolworm

p = woolworm()  # Creates the "woolworm" class

f = "filename.jpg"
base_name = f.replace(".jpg", "")

# Step 1: Load original
img = p.load(f)

# Step 2: de-skew
img = p.deskew_with_hough(img)

# Step 3: This is kinda weird, and currently fine-tuned for use with NU's environmental impact statements
# Long story short, the programming will use some heuristics to detect if the image is a diagram or mostly text
# If the program thinks it is text, it will binarize, if it thinks it is a diagram, it will not.
img = p.binarize_or_gray(img)

p.save_image(img)

Sample output:

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
assets		assets
cli		cli
cookbook		cookbook
scripts		scripts
woolworm		woolworm
.bumpversion.cfg		.bumpversion.cfg
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

woolworm (Pre-Alpha State)

Road to v0.1.0

Road to v1.0.0

Prerequisites

Quickstart

About

Uh oh!

Releases 9

Packages

Languages

nulib-ds/woolworm

Folders and files

Latest commit

History

Repository files navigation

woolworm (Pre-Alpha State)

Road to v0.1.0

Road to v1.0.0

Prerequisites

Quickstart

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Languages

Packages