Name	Name	Last commit message	Last commit date
parent directory ..
DM_101	DM_101
libyan_gazetteer	libyan_gazetteer
ocr_with_w2v	ocr_with_w2v
README.md	README.md
prepare_dirs.py	prepare_dirs.py
prepare_dirs_utils.py	prepare_dirs_utils.py

DH2022: Digital Humanities conference

Installation and setup
Tutorials
Credits and re-use terms

Installation and setup

We strongly recommend installation via Anaconda (refer to Anaconda website and follow the instructions).

Create a new environment

conda create -n py39deezy python=3.9

Activate the environment:

conda activate py39deezy

Install DeezyMatch via PyPi:

pip install DeezyMatch

Install other dependencies:

pip install "thefuzz"
pip install "gensim"
pip install "nltk"

⚠️ (optional) You can also install thefuzz[speedup] which provides a 4-10x speedup in String Matching (reference: https://github.com/seatgeek/thefuzz):

pip install "thefuzz[speedup]"

Clone the DeezyMatch tutorials repository:

git clone https://github.com/Living-with-machines/DeezyMatch_tutorials.git

Download data:

For the libyan_gazetteer tutorial, we need to download data from download.geonames.org and slsgazetteer.org (see Credits and re-use terms for info on licenses).

python prepare_dirs.py

This will create the following files/directories in libyan_gazetteer/data:

libyan_gazetteer
├── data
│   ├── LY.txt
│   ├── alternateNamesV2.txt
│   └── hgl_data.json
├── inputs
│   └── ... 
└── ...

Tutorials

We have prepared three sets of tutorials for the DH2022 conference:

DM_101: Dummy string matching and ranking between a small number of queries and candidates
ocr_with_w2v: Fuzzy string matching and ranking between:
- queries: tokens in English containing OCR errors
- candidates: list of words in the English language
libyan_gazetteer: Toponym matching and ranking between:
- queries: toponyms obtained from the Heritage Gazetteer of Libya (HGL)
- candidates: list of toponyms (and alternate names) belonging to places in current-day Libya, from Geonames.

We start with the DM_101 tutorial:

Go to DM_101 directory:

cd DH2022/DM_101

Open Jupyter Notebook: tutorial_101.

Here, we use already created datasets (i.e., queries/candidates/pairs) for training and using a DeezyMatch model. This includes training a pair classifier and using the trained model for candidate ranking.

We continue with the ocr_with_w2v tutorial:

Go to ocr_with_w2v directory:

cd DH2022/ocr_with_w2v

Open Jupyter Notebook: tutorial_ocr_w2v.

These datasets were created in Jupyter Notebook: prepare_dataset.

Next, we go to libyan_gazetteer tutorial:

Go to libyan_gazetteer directory:

cd DH2022/libyan_gazetteer

Open Jupyter Notebook: tutorial_hgl.

Similar to the previous tutorial, we use already created datasets (i.e., queries/candidates/pairs) for training and using a DeezyMatch model. This includes training a pair classifier and using the trained model for candidate ranking.

These datasets were created in Jupyter Notebook: prepare_dataset.

Credits and re-use terms

GeoNames Gazetteer extract files is licensed under a Creative Commons Attribution 4.0 License, see https://creativecommons.org/licenses/by/4.0/.

Heritage Gazetteer of Libya (HGL) is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

DH2022: Digital Humanities conference

Installation and setup

Tutorials

Credits and re-use terms

FilesExpand file tree

DH2022

Directory actions

More options

Directory actions

More options

Latest commit

History

DH2022

Folders and files

parent directory

README.md

DH2022: Digital Humanities conference

Installation and setup

Tutorials

Credits and re-use terms