Library

embedprepro is a command-line tool designed for text analysis tasks, including embedding, clustering, dimensionality reduction, and visualization. This tool leverages various machine learning and data processing techniques to provide a comprehensive solution for text data analysis.

Installation

You can install the package directly using pip:

!pip install -U embedprepro

If you prefer to install the package from the source, clone the repository and install it using pip:

git clone https://github.com/Elma-dev/embedprepro-lib.git
cd text_analysis_cli
pip install .

Usage

Command-line Interface

The text_analysis_cli provides a command-line interface for performing various text analysis tasks.

embedprepro [OPTIONS] COMMAND [ARGS]...

The main commands available are:

clustring
embedding
reduction
visualization

To get help on any command, use the --help option:


embedprepro COMMAND --help

Embedding

Generate embeddings for text data using a specified model and embedder type.

embedprepro embedding <input_file> <output_file> [options]

input file: the file contain your data (example.csv)
output file: the file you want to saved the result (result.npy)

Options

Option	Description	Default
--et	Embedder type	sentence_transformer
--mn	Model name	all-MiniLM-L6-v2
--col	Column name in the input CSV file containing text	text
--bs	Batch size	32
--p	Number of parallel processes	1

available emebdder type

Embedder	available
sentence_transformer	✅

embedprepro embedding input.csv output_embeddings.npy --et sentence_transformer --mn all-MiniLM-L6-v2 --col text --bs 32 --p 2

Dimensionality Reduction

Reduce the dimensionality of text embeddings.

embedprepro reduction <input_file> <output_file> [options]

if input is text data then reduction firstly embed data before reduction
you can also make embeding.npy as input_file bby adding --with_embedding 1 option

Option	Description	Default
--nc	Number of components to reduce to	2
--ng	Number of neighbors	15
--md	Minimum distance	0.5
--metric	Distance metric	euclidean
--et	Embedder type	sentence_transformer
--mn	Model name	all-MiniLM-L6-v2
--col	Column name in the input CSV file containing text	text
--algorithm	Dimensionality reduction algorithm	PCA
--with_embedding	Use precomputed embeddings	False (0)

available Reduction Algorithm

Algorithm available

PCA ✅

UMAP ✅

embedprepro reduction input.csv dimreduction.npy --nc 2 --ng 15 --md 0.5 --metric euclidean --et sentence_transformer --mn all-MiniLM-L6-v2 --col text --algorithm PCA --with_embedding False

Clustering

Perform agglomerative clustering on text data or embeddings.

embedprepro clustering <input_file> <output_file> [options]

if input is text data then clustering firstly embed data before clustering

Options

Option	Description	Default
--et	Embedder type	sentence_transformer
--mn	Model name	all-MiniLM-L6-v2
--col	Column name in the input CSV file containing text	text
--bs	Batch size	32
--p	Number of parallel processes	1
--threshold	Clustering threshold	0.5
--min_cluster_size	Minimum cluster size	1
--show_progress_bar	Show progress bar	True (1)
--with_embedding	Use precomputed embeddings	False (0)

example

 embedprepro clustering input.csv output_clusters.npy --et sentence_transformer --mn all-MiniLM-L6-v2 --col text --bs 32 --p 2 --threshold 0.5 --min_cluster_size 1 --show_progress_bar True --with_embedding False

Visualization

Visualize the results of dimensionality reduction and clustering.

embedprepro visualization <clusters_data> <reduced_data> [options]

with visualization you can plot your clustered and reduced data with 2d or 3d plot.
to use 3d plot you need just add —zi

Options

Option	Description	Default
--xi	Index of the first dimension	0
--yi	Index of the second dimension	1
--zi	Index of the third dimension	-1
--title	Title of the plot	Clusters
--xlabel	Label of the x-axis	X
--ylabel	Label of the y-axis	Y
--zlabel	Label of the z-axis	Z
--save	Save path for the plot	None

example

embedprepro visualization dimreduction.npy output_clusters.npy --xi 0 --yi 1 --zi 2 --title "Clusters" --xlabel "X" --ylabel "Y" --zlabel "Z"

Python Project

after installation you can use embedprepro inside your python project like this:

from preprocessing import *

from preprocessing package you can import:

embedding_service
agglomerative_clustering
dimensionality_reduction
visualization_service

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.idea		.idea
build/lib/preprocessing		build/lib/preprocessing
dist		dist
embedprepro.egg-info		embedprepro.egg-info
images		images
preprocessing		preprocessing
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Library

Table of Contents

Installation

Usage

Command-line Interface

Embedding

Dimensionality Reduction

Clustering

Visualization

Python Project

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Elma-dev/embedprepro-lib

Folders and files

Latest commit

History

Repository files navigation

Library

Table of Contents

Installation

Usage

Command-line Interface

Embedding

Dimensionality Reduction

Clustering

Visualization

Python Project

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages