This project presents version 1.0 of the Arabic ChainBank, the first derivational resource for Modern Standard Arabic. The ChainBank is designed to link all derivatives that belong to the same derivational family in the form of a chain, beginning with the root and progressing through successive derived forms. Each chain captures the relationships between a base form and its derivatives, offering insights into how words evolve morphologically while maintaining their semantic and grammatical connections. The Arabic ChainBank is composed of two primary data sources: a knowledge graph-based network of abstract patterns abstract network and the database of multidialectal morphological analyzer camel_morph.
The data for this version is extracted from CamelMorph v1.0, which provides the essential morphological features required for constructing the ChainBank lemmas. If you are only interested in replicating our experiment, you only need the data provided as a sample under the folder data
The ChainBank introduces the alignment between the abstract network dataset and the CamelMorph dataset to construct derivational subtree for each root. It streamlines the creation of structured derivational knowledge graphs, making it a valuable resource for linguistic analysis and computational morphology.
This process is carried out in three main stages:
Merging Data: Integrating root data with the abstract network, src/DataMerging/merge_data.py.
Building Trees: Recursively generating derivational trees starting from each root, src/TreeBuilding/build_tree.py.
Visualizing Trees: Creating graphical representations of the derivational structures, src\DataMerging\merge_data.py.
If you are interested in running a sample of data to test the adopted method to build the Chainbank, you can follow the instrcutions as in the following section. To run them using the full data extracted from Camelmorph you should to add the directory of this data.
Below are instructions on how to generate a ChainBank using an input dataset as well as how to generate visualization graphs for the ChainBank.
Note: In the below code snippets, we use a smaller set of lemmas as input, data file data/lemmas_sample.csv
. To generate the full ChainBank, use file data/lemmas.csv
instead where appropriate.
Before using the provided scrits, ensure you have the required libraries installed. You can install them using the following command:
pip install -r requirements.txt
Run the following command to merge the lemma data with the abstract network from ChainBank:
python src/merge_data.py data/abstract_network.csv data/lemmas_sample.csv output/merged.csv
This will generate a processed dataset that combines lemmas and the abstract network.
To construct the derivational trees using the merged data, execute:
python src/build_trees.py output/merged.csv output/trees.csv
This script recursively builds the tree structures for derivational relationships.
To generate and visualize the derivational trees, run:
python src/visualize_trees.py output/trees.csv output/trees.pdf
This will create a PDF file containing a graphical representation of the derivational relationships in the ChainBank dataset in forms of trees.
IMPORTANT: It is advised not to run this on the full dataset. This process is resource heavy and will generate a huge PDF file for the full data.
The data files accessed through the below links are licensed under a Creative Commons Attribution 4.0 International License.
The scripts used for generating the data files are released under the MIT license.
- Reham Marzouk
- Nizar Habash
The framwork of ChainBank 1.0 is introduced in a paper published in COLING 2025. If you find the database of Arabic ChainBank is useful in your research, please cite our paper
Reham Marzouk, Sondos Krouna, and Nizar Habash. 2025. A Derivational ChainBank for Modern Standard Arabic. In Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script, pages 78–87, Abu Dhabi, UAE. Association for Computational Linguistics.
@inproceedings{marzouk-etal-2025-derivational,
title = "A Derivational {C}hain{B}ank for {M}odern {S}tandard {A}rabic",
author = "Marzouk, Reham and
Krouna, Sondos and
Habash, Nizar",
editor = "El-Haj, Mo",
booktitle = "Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script",
month = jan,
year = "2025",
address = "Abu Dhabi, UAE",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.abjadnlp-1.9/",
pages = "78--87",
abstract = "We introduce the new concept of an Arabic Derivational Chain Bank (CHAINBANK) to leverage the relationship between form and meaning in modeling Arabic derivational morphology. We constructed a knowledge graph network of abstract patterns and their derivational relations, and aligned it with the lemmas of the CAMELMORPH morphological analyzer database. This process produced chains of derived words' lemmas linked to their correspond- ing lemma bases through derivational relations, encompassing 23,333 derivational connections. The CHAINBANK is publicly available.1"
}