This is the repository for the paper "A Semantic-Driven Framework for Adaptive Compression in Document Retrieva".
To install the required packages, please run the following command:
pip install -r requirements.txtTo have a Pytorch version specific to your CUDA, install your version before running the above command.
Input test files:google drive
We have prepared the original retrieval documents of the Natural Questions and TriviaQA datasets in the inputs folder, sourced from here.
If you need to upload your own dataset, please follow the format below:
[
{
"question": "Question text", // The question field, provide the specific question text
"answers": [
"Answer 1", // The answer field, contains one or more possible answers
"Answer 2"
],
"ctxs": [ // Context array, each context contains the following fields
{
"id": "Context ID", // Unique identifier for the context
"title": "Context title", // The title or name of the context
"text": "Context content", // The detailed text of the context, usually a paragraph or description
"score": "Relevance score", // (Optional) The relevance score of the context, higher values indicate stronger relevance
"has_answer": true // (Optional) Boolean value, indicates whether the context contains an answer
},
...
]
},
...
]To run Keyword-based compression, please use the following command
python keyword_compression.py
--input_file $INPUT_FILE
--output_file $OUTPUT_FILE
--model_name $MODEL_NAME
--output_file $OUTPUT_FILE
--compression_ratio 0.5To run Key statement-based compression, please use the following command
python key_statement_compression.py
--input_file $INPUT_FILE
--output_file $OUTPUT_FILE
--model_name $MODEL_NAME
--output_file $OUTPUT_DIR
--threshold 0.5