-
The data and most of the code in this repository is part of Paper Sign Detection for Cuneiform Tablets from Yunus Cobanoglu, Luis Sáenz, Ilya Khait, Enrique Jiménez.
Data on Zenodoo.
See https://github.com/ElectronicBabylonianLiterature/cuneiform-ocr/blob/main/README.md for an overview and general information of all repositories associated with the paper from above.
-
In addition, the folder
crop_ocr_signsrefers to a subsequent part of the project, where the resulting crops from the OCR are processed and selected. SeeCrop OCR Signsbelow.
- requirements.txt (optionally: includes opencv-python)
pip3 install torch=="2.0.1" torchvision --index-url https://download.pytorch.org/whl/cpupip install -U openmimmim install "mmocr==1.0.0rc5"it is important to use this exact version because prepare_data.py won't work in newer versions (DATA_PARSERS are not backward compatible)mim install "mmcv==2.0.0"
Make sure PYTHONPATH is root of repository
See the explanatory video here.
You can use fetch_ocr_data.sh to fetch the newest ebl data. To run the script, modify the Configuration of the script at the beginning or pass them as environment variables.
The data are fetched from our api with https://github.com/ElectronicBabylonianLiterature/ebl-api/blob/master/ebl/fragmentarium/retrieve_annotations.py.
And then filtered using filter to get fragments appliable for the training.
The initial training data and models from the paper are available for download at . The raw data were processed following the instructions below. This code processes both raw and processed data, outputting them in training-ready formats (ICDAR2015 and COCO2017), as shown in the ready-for-training.tar.gz file on Zenodo.
Directory Structure
data
processed-data
data-coco
data-icdar2015
detection
...
classification
data (after gather_all.py
...
raw-data
ebl
heidelberg
jooch
urschrei-CDP
-
Preprocessing Heidelberg Data, all Details in
cuneiform_ocr_data/heidelberg/README.md -
Ebl (our) data in
data/raw-data/ebl(generally better to create test set from ebl data because quality is better)2.1. Run
extract_contours.pywithEXTRACT_AUTMOATICALLY=Falseondata/raw-data/ebl/detection2.2. Run
display_bboxes.pyand use keys to delete all which are not good quality (N.B.: Thefilter_annotations.pyscript makes this step unnecessary). -
Run
select_test_set.pywhich will select 50 randomly images fromdata/processed-data/ebl/ebl-detection-extracted-deleted(currently no option to create val set because of small size of dataset) -
data/processed-data/ebl/ebl-detection-extracted-testhas .txt file will names of all images in test set (this will be necessary to create train,test for classification later) -
Now merge
data/processed-data/heidelberg/heidelberg-extracted-deletedandebl-detection-extracted-trainwhich will be your train set (seedata/processed-data/detection, around 295 train and 50 test instances). -
Optionally. Create Icdar2015 Style dataset using
convert_to_icdar2015.py -
Optionally: Create Coco Style Dataset
convert_to_coco.pywill create only a test set coco style
The following commands can be used to do these steps:
# Run fetch_ocr_data.sh
export MONGODB_URI="Your Mongo DB connection"
./fetch_ocr_data.sh
# Run extract_contours.py
DATE_TAG="$(date +%Y-%m-%d)"
export EBL_ANNOTATIONS_PATH="./data/filtered_annotations"
export EBL_DETECTION_EXTRACTED_PATH="./data/processed-data/ebl/ebl-$DATE_TAG/"
python -m cuneiform_ocr_data.extract_contours
# (If necessary) Remove the largest YBC* images and annotations because they are very large and can not be seperated by contours
DATE_TAG="$(date +%Y-%m-%d)"
BASE_DIR="data/processed-data/ebl/ebl-$DATE_TAG"
find "$BASE_DIR/imgs" -name "YBC*.jpg" -size +10M # first have a look
for img in $(find "$BASE_DIR/imgs" -name "YBC*.jpg" -size +10M -exec basename {} .jpg \;); do
rm -f "$BASE_DIR/annotations/gt_$img.txt"
done
find "$BASE_DIR/imgs" -name "YBC*.jpg" -size +10M -exec rm -v {} \;
# Run select_test_set.py
DATE_TAG="$(date +%Y-%m-%d)"
export EBL_DETECTION_EXTRACTED_PATH="./data/processed-data/ebl/ebl-$DATE_TAG/"
export TEST_SET_SIZE="100"
export TEST_PATH="./data/processed-data/ebl/test"
export DELETE_EMPTY_IMGS="yes"
python -m cuneiform_ocr_data.select_test_set
# Convert to coco dataset
export DATA_TRAIN="data/processed-data/ebl/train"
export DATA_TEST="data/processed-data/ebl/test"
export COCO_OUTPUT_PATH="data/processed-data/data-coco"
python -m cuneiform_ocr_data.convert_to_coco_recognition
# Convert to icdar dataset
export ICDAR_DATA_PATH="./data/processed-data/ebl/ebl-$DATE_TAG/"
export ICDAR_TEST_SET_PATH="./data/processed-data/ebl/test/test_imgs.txt"
# export ICDAR_OUTPUT_PATH="./data/processed-data/data-icdar"
unset ICDAR_OUTPUT_PATH
python -m cuneiform_ocr_data.convert_to_icdar2015_detectionimage: P3310-0.jp, with gt_P3310-0.txt.
Ground truth contains top left x, top left y, width, height and sign.
Sign followed by ? means it is partially broken. Unclear signs have value 'UnclearSign'.
Example: 0,0,10,10,KUR
- Fetch Sign Images from https://labasi.acdh.oeaw.ac.at/ using their api in
cuneiform_ocr_data/labasi - Run
classification/cdp/main.pywill map data using our ABZ Sign Mapping some signs can't be mapped (should be checked by an assyiriologist for correctness) - Run
classification/jooch/main.pywill map data using our ABZ Sign Mapping some signs can't be mapped (should be checked by an assyiriologist for correctness) - Merge
data/processed-data/heidelberg/heidelberganddata/raw-data/ebl/ebl-classificationtodata/processed-data/ebl+heidelberg-classification - Split
data/processed-data/ebl+heidelberg-classificationtodata/processed-data/ebl+heidelberg-classification-trainanddata/processed-data/ebl+heidelberg-classification-testby copying all files which are part of the detection test set using the scriptmove_test_set_for_classification.py - Run
crop_sign.pyondata/processed-data/ebl+heidelberg-classification-trainanddata/processed-data/ebl+heidelberg-classification-testyou can modifycrop_signs.pyto include/exclude partially broken or UnclearSigns. data/processed-data/classificationshould contain Cuneiform Dataset JOOCH, ebl+heidelberg/ebl+heidelberg-train, ebl+heidelberg/ebl+heidelberg-test, labasi and urschrei-CDP-processedgather_all.pywill gather and finalize the format for training/testing of all the folders from 7. (gather_all.py will create "cuneiform_ocr_data/classification/data" directory with classes.txt which has all classes used for training/testing)
- Use
move_test_set_for_classification.pyto extract all images belonging to detection test set for classification - Images are cropped from LMU and Heidelberg using
crop_signs.pyand converted to ABZ Sign List via ebl.txt mapping from OraccGlobalSignList/ MZL to ABZ Number- Partially Broken and Unclear Signs can be dealt included/excluded on parameter in script
- Images from CDP (urschrei-cdp) are renamed using the mapping from the urschrei-repo https://github.com/urschrei/CDP/csvs (look at cuneiform_ocr/preprocessing_cdp)
- Images are renamed
rename_to_mzl.py - Images are mapped via urschrei-cdp corrected_instances_forimport.xlsx and custom mapping via
convert_cdp_and_jooch.py
- Images are renamed
- Cuneiform JOOCH images are not used due to bad quality currently
- Labasi Project is scraped with
labasi/crawl_labasi_page.py(can take multiple hours with interruptions) and renamed manually to fit ebl.txt mapping. It should also be possible to query Labasi through its API.
- Deep learning of cuneiform sign detection with weak supervision using transliteration alignment https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0243039
- Annotated Tablets (75 Tablets) https://compvis.github.io/cuneiform-sign-detection-dataset/
- -> Heidelberg Data
- Towards Query-by-eXpression Retrieval of Cuneiform Signs [https://patrec.cs.tu-dortmund.de/pubs/papers/Rusakov2020-TQX]
- -> JOOCH Dataset https://graphics-data.cs.tu-dortmund.de/docs/publications/cuneiform/
- Labasi Project https://labasi.acdh.oeaw.ac.at/
- CDP Project https://github.com/urschrei/CDP
- LMU https://www.ebl.lmu.de/
@article{CobanogluSáenzKhaitJiménez+2024,
url = {https://doi.org/10.1515/itit-2024-0028},
title = {Sign detection for cuneiform tablets},
title = {},
author = {Yunus Cobanoglu and Luis Sáenz and Ilya Khait and Enrique Jiménez},
journal = {it - Information Technology},
doi = {doi:10.1515/itit-2024-0028},
year = {2024},
lastchecked = {2024-06-01}
}
The results from the first iteration of the 2024 OCR model are in the file eBL_OCRed_Signs.json. To run the code, you will need to download that file from the 2024 Paper in Zenodo and put at the top level folder of the repo.
To crop the signs into a organised folders, run crop_ocr_signs/extract_data.py.
If it shows 'no module found' or a similar error, run export PYTHONPATH="{your_local_path_to_the_repo}/cuneiform-ocr-data"
The logic of deciding which subset of the cropped signs to check is in crop_ocr_signs/verify_signs/get_partial_order_signs.py. Signs obtained are, by rough approximation, more likely to be wrongly read by OCR.