This March, I observed that the website kalimmat.com was experiencing internal server errors, making certain pages and specific regex inaccessible. After attempting to notify them without receiving any response, I devised a script to salvage the dataset available on their site to ensure this valuable resources were preserved. Currently, the website is operational, and I have validated its functionality using my script.
-
words.py: This script is utilized to scrape Arabic words from
kalimmat.com. Depending on the specified URL, it can fetch words that either start or end with a particular Arabic letter. The words are then processed, cleaned, and saved as a JSON file. -
readjson.py: As the name suggests, this script reads the JSON files created by
words.py. It specifically retrieves words that end with a defined sequence of letters, but this behavior can be easily customized.
- startWithWords.json: Contains words from the common dictionary scraped under the "starts-with" URL.
- startsWithWordsSome.json: Features words from the official dictionary scraped under the "starts-with" URL.
- startWithWordsOLD.json: An older dataset, representing the initial attempt to salvage the words from the site.
- endsWithWords.json: Contains words from the common dictionary scraped under the "ends-with" URL.
- endsWithWordsSome.json: Contains words from the official dictionary scraped under the "ends-with" URL.
- endsWithWordsOLD.json: An older version of the dataset scraped under the "ends-with" URL.
-
Scraping Words: Run the
words.pyscript after selecting the desired URL (starts-with or ends-with). The scraped words will be saved to the respective JSON file. -
Reading the JSON Files: Use the
readjson.pyscript to read and process data from the JSON files. For instance, to retrieve words that end with a specific sequence of letters.
I would like to express my appreciation to the team behind kalimmat.com for providing such a valuable resource for the Arabic language. This dataset was created purely for preservation purposes, and all credit for the content goes to them.