Skip to content

ElectronicBabylonianLiterature/cuneiform-ocr-data

Repository files navigation

Cuneiform OCR Data Preprocessing and Post Processing, part of the eBL Project (Website), (GitHub)

  1. The data and most of the code in this repository is part of Paper Sign Detection for Cuneiform Tablets from Yunus Cobanoglu, Luis Sáenz, Ilya Khait, Enrique Jiménez.
    Data on Zenodoo DOI.

    See https://github.com/ElectronicBabylonianLiterature/cuneiform-ocr/blob/main/README.md for an overview and general information of all repositories associated with the paper from above.

  2. In addition, the folder crop_ocr_signs refers to a subsequent part of the project, where the resulting crops from the OCR are processed and selected. See Crop OCR Signs below.

Installation

  • requirements.txt (optionally: includes opencv-python)
  • pip3 install torch=="2.0.1" torchvision --index-url https://download.pytorch.org/whl/cpu
  • pip install -U openmim
  • mim install "mmocr==1.0.0rc5" it is important to use this exact version because prepare_data.py won't work in newer versions (DATA_PARSERS are not backward compatible)
  • mim install "mmcv==2.0.0"

Make sure PYTHONPATH is root of repository

See the explanatory video here.

Data

You can use fetch_ocr_data.sh to fetch the newest ebl data. To run the script, modify the Configuration of the script at the beginning or pass them as environment variables.

The data are fetched from our api with https://github.com/ElectronicBabylonianLiterature/ebl-api/blob/master/ebl/fragmentarium/retrieve_annotations.py.

And then filtered using filter to get fragments appliable for the training.

The initial training data and models from the paper are available for download at DOI. The raw data were processed following the instructions below. This code processes both raw and processed data, outputting them in training-ready formats (ICDAR2015 and COCO2017), as shown in the ready-for-training.tar.gz file on Zenodo. Directory Structure

data
  processed-data
    data-coco
    data-icdar2015
    detection
    ...
    classification
      data (after gather_all.py
      ...
  raw-data
    ebl
    heidelberg
    jooch
    urschrei-CDP

Data Preprocessing for Text Detection (Predict only Bounding Boxes)

  1. Preprocessing Heidelberg Data, all Details in cuneiform_ocr_data/heidelberg/README.md

  2. Ebl (our) data in data/raw-data/ebl (generally better to create test set from ebl data because quality is better)

    2.1. Run extract_contours.py with EXTRACT_AUTMOATICALLY=False on data/raw-data/ebl/detection

    2.2. Run display_bboxes.py and use keys to delete all which are not good quality (N.B.: The filter_annotations.py script makes this step unnecessary).

  3. Run select_test_set.py which will select 50 randomly images from data/processed-data/ebl/ebl-detection-extracted-deleted (currently no option to create val set because of small size of dataset)

  4. data/processed-data/ebl/ebl-detection-extracted-test has .txt file will names of all images in test set (this will be necessary to create train,test for classification later)

  5. Now merge data/processed-data/heidelberg/heidelberg-extracted-deleted and ebl-detection-extracted-train which will be your train set (see data/processed-data/detection, around 295 train and 50 test instances).

  6. Optionally. Create Icdar2015 Style dataset using convert_to_icdar2015.py

  7. Optionally: Create Coco Style Dataset convert_to_coco.py will create only a test set coco style

The following commands can be used to do these steps:

# Run fetch_ocr_data.sh
export MONGODB_URI="Your Mongo DB connection"
./fetch_ocr_data.sh 

# Run extract_contours.py
DATE_TAG="$(date +%Y-%m-%d)"
export EBL_ANNOTATIONS_PATH="./data/filtered_annotations"
export EBL_DETECTION_EXTRACTED_PATH="./data/processed-data/ebl/ebl-$DATE_TAG/"
python -m cuneiform_ocr_data.extract_contours 

# (If necessary) Remove the largest YBC* images and annotations because they are very large and can not be seperated by contours
DATE_TAG="$(date +%Y-%m-%d)"
BASE_DIR="data/processed-data/ebl/ebl-$DATE_TAG"
find "$BASE_DIR/imgs" -name "YBC*.jpg" -size +10M # first have a look

for img in $(find "$BASE_DIR/imgs" -name "YBC*.jpg" -size +10M -exec basename {} .jpg \;); do
    rm -f "$BASE_DIR/annotations/gt_$img.txt"
done
find "$BASE_DIR/imgs" -name "YBC*.jpg" -size +10M -exec rm -v {} \;

# Run select_test_set.py
DATE_TAG="$(date +%Y-%m-%d)"
export EBL_DETECTION_EXTRACTED_PATH="./data/processed-data/ebl/ebl-$DATE_TAG/"
export TEST_SET_SIZE="100"
export TEST_PATH="./data/processed-data/ebl/test"
export DELETE_EMPTY_IMGS="yes"
python -m cuneiform_ocr_data.select_test_set

# Convert to coco dataset
export DATA_TRAIN="data/processed-data/ebl/train"
export DATA_TEST="data/processed-data/ebl/test"
export COCO_OUTPUT_PATH="data/processed-data/data-coco"
python -m cuneiform_ocr_data.convert_to_coco_recognition

# Convert to icdar dataset
export ICDAR_DATA_PATH="./data/processed-data/ebl/ebl-$DATE_TAG/"
export ICDAR_TEST_SET_PATH="./data/processed-data/ebl/test/test_imgs.txt"
# export ICDAR_OUTPUT_PATH="./data/processed-data/data-icdar"
unset ICDAR_OUTPUT_PATH
python -m cuneiform_ocr_data.convert_to_icdar2015_detection

Dataformat

image: P3310-0.jp, with gt_P3310-0.txt.

Ground truth contains top left x, top left y, width, height and sign.

Sign followed by ? means it is partially broken. Unclear signs have value 'UnclearSign'.

Example: 0,0,10,10,KUR

Data Preprocessing for Image Classification

  1. Fetch Sign Images from https://labasi.acdh.oeaw.ac.at/ using their api in cuneiform_ocr_data/labasi
  2. Run classification/cdp/main.py will map data using our ABZ Sign Mapping some signs can't be mapped (should be checked by an assyiriologist for correctness)
  3. Run classification/jooch/main.py will map data using our ABZ Sign Mapping some signs can't be mapped (should be checked by an assyiriologist for correctness)
  4. Merge data/processed-data/heidelberg/heidelberg and data/raw-data/ebl/ebl-classification to data/processed-data/ebl+heidelberg-classification
  5. Split data/processed-data/ebl+heidelberg-classification to data/processed-data/ebl+heidelberg-classification-train and data/processed-data/ebl+heidelberg-classification-test by copying all files which are part of the detection test set using the script move_test_set_for_classification.py
  6. Run crop_sign.py on data/processed-data/ebl+heidelberg-classification-train and data/processed-data/ebl+heidelberg-classification-test you can modify crop_signs.py to include/exclude partially broken or UnclearSigns.
  7. data/processed-data/classification should contain Cuneiform Dataset JOOCH, ebl+heidelberg/ebl+heidelberg-train, ebl+heidelberg/ebl+heidelberg-test, labasi and urschrei-CDP-processed
  8. gather_all.py will gather and finalize the format for training/testing of all the folders from 7. (gather_all.py will create "cuneiform_ocr_data/classification/data" directory with classes.txt which has all classes used for training/testing)

Data Preprocessing for Image (Sign) Classification (Details)

  1. Use move_test_set_for_classification.py to extract all images belonging to detection test set for classification
  2. Images are cropped from LMU and Heidelberg using crop_signs.py and converted to ABZ Sign List via ebl.txt mapping from OraccGlobalSignList/ MZL to ABZ Number
    • Partially Broken and Unclear Signs can be dealt included/excluded on parameter in script
  3. Images from CDP (urschrei-cdp) are renamed using the mapping from the urschrei-repo https://github.com/urschrei/CDP/csvs (look at cuneiform_ocr/preprocessing_cdp)
    • Images are renamed rename_to_mzl.py
    • Images are mapped via urschrei-cdp corrected_instances_forimport.xlsx and custom mapping via convert_cdp_and_jooch.py
  4. Cuneiform JOOCH images are not used due to bad quality currently
  5. Labasi Project is scraped with labasi/crawl_labasi_page.py (can take multiple hours with interruptions) and renamed manually to fit ebl.txt mapping. It should also be possible to query Labasi through its API.

Acknowledgements/ Citation

Cite this paper

@article{CobanogluSáenzKhaitJiménez+2024,
url = {https://doi.org/10.1515/itit-2024-0028},
title = {Sign detection for cuneiform tablets},
title = {},
author = {Yunus Cobanoglu and Luis Sáenz and Ilya Khait and Enrique Jiménez},
journal = {it - Information Technology},
doi = {doi:10.1515/itit-2024-0028},
year = {2024},
lastchecked = {2024-06-01}
}

Crop OCR Signs

The results from the first iteration of the 2024 OCR model are in the file eBL_OCRed_Signs.json. To run the code, you will need to download that file from the 2024 Paper in Zenodo and put at the top level folder of the repo.

To crop the signs into a organised folders, run crop_ocr_signs/extract_data.py.

If it shows 'no module found' or a similar error, run export PYTHONPATH="{your_local_path_to_the_repo}/cuneiform-ocr-data"

The logic of deciding which subset of the cropped signs to check is in crop_ocr_signs/verify_signs/get_partial_order_signs.py. Signs obtained are, by rough approximation, more likely to be wrongly read by OCR.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5