Skip to content

feat: Implement CI Pipeline for Dictionary Data Validation #8

@itsbilolbek

Description

@itsbilolbek

Goal

Introduce a Continuous Integration (CI) pipeline that automatically validates the quality, formatting, and structural consistency of the dictionary entries (within the translations.toml) upon every code change. This ensures data integrity and prevents common data entry errors.

Implementation Details (Tasks)

The CI pipeline must include an automated script (e.g., a Python or JavaScript script run by GitHub Actions) to perform the following checks:

1. Data Structure and Completeness Checks

  • Field Presence: Check if all expected fields (keys) are present in every dictionary entry. The fields every entry must have are as follows: en, uz, part_of_speech, description, pronunciation_uz, similar, status.
  • Field Completion: Check if all fields that require translation are filled out.
  • Unique Keys: Verify that no two dictionary entries (keys) are identical.
  • Conditional Completion: If the status field is set to "Needs translation", the corresponding translation fields are allowed to be empty.
  • Multiple Choice Fields: Validate that values for multiple-choice fields (e.g., part_of_speech, status) are selected from an approved, predefined list of values. part_of_speech can only have these values: "noun", "verb", "adjective", "adverb", "interjection". status can only have these values: "Needs translation", "Pending review", "Obsolete", "Approved", "Do not translate".

2. Content and Linguistic Quality Checks

  • Punctuation/Typographical Use: Verify the correct usage of diacritical marks/apostrophes common in the language, specifically checking for the proper use of: tutuq belgisi (ʼ) or okina (ʻ), and flagging the use of incorrect symbols like the straight apostrophe ('), fancy quotes (, ), or grave accent (```) in places where the correct mark is required.
  • Case Rule: All text in the en and uz fields must be in lowercase, unless the word is an abbreviation or a proper name.
  • Leading/Trailing Whitespace: Check that there are no unnecessary blank spaces at the beginning or end of any translation string.
  • Empty String Check: Validate that no required field contains an empty string ("").

3. Optional Formatting and Utility

  • Entry Sorting (Optional): Check if the dictionary entries (keys) are sorted alphabetically. This is optional but highly recommended for maintainability.
  • Sorting Script: Create an accompanying script (e.g., scripts/sort_dictionary.py) that can be run locally or within the CI to automatically sort the dictionary entries based on their primary key, allowing maintainers to easily fix sorting issues.

Acceptance Criteria

  • A new CI job is added to the pipeline
  • This job is triggered on pushes to the main branch and pull
  • The script executes all specified
  • If any check fails (e.g., a field is missing, or an incorrect apostrophe is used), the CI pipeline fails, preventing the merged code from breaking the data integrity.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions