Kalimmat Arabic Words Collection

Introduction

This March, I observed that the website kalimmat.com was experiencing internal server errors, making certain pages and specific regex inaccessible. After attempting to notify them without receiving any response, I devised a script to salvage the dataset available on their site to ensure this valuable resources were preserved. Currently, the website is operational, and I have validated its functionality using my script.

Script Overview

words.py: This script is utilized to scrape Arabic words from kalimmat.com. Depending on the specified URL, it can fetch words that either start or end with a particular Arabic letter. The words are then processed, cleaned, and saved as a JSON file.
readjson.py: As the name suggests, this script reads the JSON files created by words.py. It specifically retrieves words that end with a defined sequence of letters, but this behavior can be easily customized.

Dataset Files

startWithWords.json: Contains words from the common dictionary scraped under the "starts-with" URL.
startsWithWordsSome.json: Features words from the official dictionary scraped under the "starts-with" URL.
startWithWordsOLD.json: An older dataset, representing the initial attempt to salvage the words from the site.
endsWithWords.json: Contains words from the common dictionary scraped under the "ends-with" URL.
endsWithWordsSome.json: Contains words from the official dictionary scraped under the "ends-with" URL.
endsWithWordsOLD.json: An older version of the dataset scraped under the "ends-with" URL.

Usage

Scraping Words: Run the words.py script after selecting the desired URL (starts-with or ends-with). The scraped words will be saved to the respective JSON file.
Reading the JSON Files: Use the readjson.py script to read and process data from the JSON files. For instance, to retrieve words that end with a specific sequence of letters.

Acknowledgments

I would like to express my appreciation to the team behind kalimmat.com for providing such a valuable resource for the Arabic language. This dataset was created purely for preservation purposes, and all credit for the content goes to them.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Datasets		Datasets
LICENSE		LICENSE
README.md		README.md
readJson.py		readJson.py
words.py		words.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kalimmat Arabic Words Collection

Introduction

Script Overview

Dataset Files

Usage

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Kalimmat Arabic Words Collection

Introduction

Script Overview

Dataset Files

Usage

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages