Introduction

A program to align segmented Qurʾān translation text to Arabic word ranges.

Installation

You can use either pip or pipx for installation.

If you only want to align:

pip install git+https://git.sr.ht/~rehandaphedar/rabtize

If you want to generate embeddings as well:

pip install "rabtize[embed] @ git+https://git.sr.ht/~rehandaphedar/rabtize"

If you want to use XPU:

pip install "rabtize[embed,embed-xpu] @ git+https://git.sr.ht/~rehandaphedar/rabtize" --extra-index-url https://download.pytorch.org/whl/xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

A slightly modified command is required while using pipx:

pipx install "rabtize[embed,embed-xpu] @ git+https://git.sr.ht/~rehandaphedar/rabtize" --pip-args="--extra-index-url https://download.pytorch.org/whl/xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/"

As for other hardware acceleration backends, I cannot test them myself. However, contributions to add support for them are welcome.

Usage

From the Quranic Universal Library (QUL) (or from any other source with the same schema) obtain the following:

A Quran script (qpc-hafs-word-by-word.json).
A translation (-simple.json) with segments in jumlize format.

These will serve as --words/-w and --translation/-t for all other commands.

Further documentation for CLI flags can be accessed by appending -h to the program or to any subcommand.

Generating Embeddings

Generate 2 sets of embeddings:

rabtize embed spans embeddings/spans.npz
rabtize embed segments embeddings/en-sahih-international-simple.npz

Spans embeddings: These are the embeddings for each possible (sequential) range of words found in the Quran script. Currently, these amount to just under 700k texts. These embeddings only have to be generated once (per model) and can be reused across different translations.
Segments embeddings: These are the embeddings for each segment in the translation. These must be generated separately for each translation.

Aligning Segments

rabtize align -sp embeddings/spans.npz -se embeddings/en-sahih-international-simple.npz results/en-sahih-international-simple.json

Output Format

The resulting output will be of the following format (Only the words field is added, rest of the output is the same as jumlize's output format):

{
	"[verse_key]": {
		"t": "[text]",
		"segments": [
			{
				"t": "[segment_text]",
				"word_range": {
					"start": [start_index],
					"end": [end_index]
				}
			},
			{
				"t": "[segment_text]",
				"word_range": {
					"start": [start_index],
					"end": [end_index]
				}
			}
		]
	}
}

Both start_index and end_index are 1 based and inclusive.

Results

Generated embeddings can be found in the rabtize dataset on Hugging Face.

Aligned translations can be found under results/.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
embeddings @ 0011c80		embeddings @ 0011c80
rabtize		rabtize
results		results
.build.yml		.build.yml
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Installation

Usage

Generating Embeddings

Aligning Segments

Output Format

Results

About

Uh oh!

Releases

Packages

Languages

License

rehandaphedar/rabtize

Folders and files

Latest commit

History

Repository files navigation

Introduction

Installation

Usage

Generating Embeddings

Aligning Segments

Output Format

Results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages