This project explores state-of-the-art techniques for optimizing input to Large Language Models (LLMs), focusing on reducing token costs while maintaining or improving performance. Techniques include stop-word removal, Named Entity Recognition (NER), keyword extraction, and TF-IDF.
config/: Configuration files and scripts.data/: Datasets and data-related documentation.scripts/: Scripts to run experiments and pipelines.src/: Source code for preprocessing, optimization techniques, LLM interface, and evaluation.optimization_techniques/: Modules for each input optimization method (stop words, NER, TF-IDF, etc).
outputs/: Contains output files generated by scripts, such as test results.
requirements.txt: Python dependencies..env/.env.example: Environment variable configuration (e.g., API keys).
- Install dependencies:
pip install -r requirements.txt - For additional problem force the updates of gensim: pip install --force-reinstall --upgrade scipy gensim
- Install the necessary spacy dictionary
python -m spacy download en_core_web_sm - Configure your environment variables in
.env. - Run experiments using scripts in the
scripts/folder. - Test outputs are also saved to
outputs/test_optimization_techniques_output.txtfor easier review.