NepaliKit is a Python library for natural language processing tasks in the Nepali language.
You can install NepaliKit using pip:
pip install nepalikit
Alternatively, you can clone the repository and install it manually:
git clone https://github.com/prabhashj07/nepalikit.git
cd nepalikit
pip install .
NepaliKit provides the following features:
- Tokenization: Tokenize Nepali text using the SentencePiece tokenizer.
- Preprocessing: Clean and preprocess Nepali text data, including removing HTML tags, special characters, and other cleaning tasks.
- Stopword Management: Load and remove stopwords from Nepali text.
- Sentence Operations: Segment Nepali text into sentences based on punctuation marks.
- SentencePiece Model Training: Train custom SentencePiece models for Nepali text data.
- Utility Functions: Various utility functions for text processing and manipulation.
- Integration with PyTorch: Utilities for integrating with PyTorch for machine learning tasks.
from nepalikit.tokenization import Tokenizer
text = "नमस्ते, के छ खबर? यो एउटा वाक्य हो।"
tokenizer = Tokenizer()
# Sentence tokenization
sentences = tokenizer.tokenize(text, level='sentence')
print(sentences)
# Word tokenization
words = tokenizer.tokenize(text, level='word')
print(words)
# Character tokenization
characters = tokenizer.tokenize(text, level='characters')
print(characters)from nepalikit.tokenization import SentencePieceTokenizer
text = "नमस्ते, के छ खबर?"
tokenizer = SentencePieceTokenizer()
tokens = tokenizer.tokenize(text)
print(tokens)
# Detokenization
original_text = tokenizer.detokenize(tokens)
print(original_text)from nepalikit.preprocessing import TextProcessor
text = "<p>नमस्ते, के छ खबर?</p>"
processor = TextProcessor()
clean_text = processor.remove_html_tags(text)
clean_text = processor.remove_special_characters(clean_text)
print(clean_text)from nepalikit.manage_stopwords import load_stopwords, remove_stopword
stopwords = load_stopwords('/path/to/stopword/directory')
remove_stopword('कुनै_स्टापवर्ड')The TextProcessor class provides various methods for text preprocessing:
remove_html_tags(text): Removes HTML tags from the text.remove_special_characters(text): Removes special characters, keeping only Devanagari characters and spaces.remove_extra_whitespace(text): Removes extra whitespace from the text.remove_stopwords(text): Removes stopwords from the text.normalize_text(text): Converts the text to lowercase.preprocess_text(text): Applies all preprocessing steps to the text.get_word_frequency(tokens): Returns the frequency of words in a list of tokens.
The urls_emails class provides methods to remove or replace URLs and email addresses in the text:
replace_urls_emails(text): Replaces URLs and email addresses with specified replacements.remove_urls_emails(text): Removes URLs and email addresses from the text.
The sentence_operation folder contains various modules for sentence-level operations:
extract_sentences.py: Extracts sentences from text.load_abbreviation.py: Loads abbreviations for text processing.normalize_text.py: Normalizes text.segment_sentences.py: Segments text into sentences.sentence_stats.py: Provides statistics about sentences.
The Tokenizer class provides the following methods:
sentence_tokenize(text): Tokenizes input text into sentences based on '।' character.word_tokenize(sentence, new_punctuation=None): Tokenizes input sentence into words, handling specified punctuation.character_tokenize(word): Tokenizes input word into characters.tokenize(text, level='word', new_punctuation=None): General tokenization method for sentence, word, or character level.sentence_detokenize(sentences): Detokenizes a list of sentences back into the original text.word_detokenize(words): Detokenizes a list of words back into the original sentence.character_detokenize(characters): Detokenizes a list of characters back into the original word.detokenize(tokens, level='word'): General detokenization method for sentence, word, or character level.
The SentencePieceTokenizer class provides the following methods:
tokenize(text): Tokenizes text using the SentencePiece model.detokenize(tokens): Detokenizes text using the SentencePiece model.
The NepaliTextProcessor class in utils.py offers additional text processing capabilities:
merge_text(tokens): Merges a list of tokens into a single string.split_text(text): Splits a text string into a list of tokens.count_words(text): Counts the number of words in a text string.count_words_in_paragraph(paragraph): Counts the total number of words in a paragraph.
This project is licensed under the MIT License.
- Prabhash Kumar Jha
- Email: [email protected]