SDAC

This is the repository for the paper "A Semantic-Driven Framework for Adaptive Compression in Document Retrieva".

0. Setup

To install the required packages, please run the following command:

pip install -r requirements.txt

To have a Pytorch version specific to your CUDA, install your version before running the above command.

Input test files:google drive

1.Data

We have prepared the original retrieval documents of the Natural Questions and TriviaQA datasets in the inputs folder, sourced from here.

Dataset Format Instructions

If you need to upload your own dataset, please follow the format below:

[
    {
        "question": "Question text",  // The question field, provide the specific question text
        "answers": [
            "Answer 1",  // The answer field, contains one or more possible answers
            "Answer 2"
        ],
        "ctxs": [  // Context array, each context contains the following fields
            {
                "id": "Context ID",  // Unique identifier for the context
                "title": "Context title",  // The title or name of the context
                "text": "Context content",  // The detailed text of the context, usually a paragraph or description
                "score": "Relevance score",  // (Optional) The relevance score of the context, higher values indicate stronger relevance
                "has_answer": true  // (Optional) Boolean value, indicates whether the context contains an answer
            },
            ...
        ]
    },
    ...
]

2.Model

When performing keyword compression, the gpt2-xl is used by default.

When executing key statement compression, the e5 is used by default. It can be replaced as needed.

3.Keyword-based compression quick start

To run Keyword-based compression, please use the following command

python keyword_compression.py 
--input_file  $INPUT_FILE
--output_file $OUTPUT_FILE
--model_name  $MODEL_NAME
--output_file $OUTPUT_FILE
--compression_ratio 0.5

4.Key statement-based compression quick start

To run Key statement-based compression, please use the following command

python key_statement_compression.py 
--input_file  $INPUT_FILE
--output_file $OUTPUT_FILE
--model_name  $MODEL_NAME
--output_file $OUTPUT_DIR
--threshold 0.5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDAC

0. Setup

1.Data

Dataset Format Instructions

2.Model

When performing keyword compression, the gpt2-xl is used by default.

When executing key statement compression, the e5 is used by default. It can be replaced as needed.

3.Keyword-based compression quick start

4.Key statement-based compression quick start

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

SDAC

0. Setup

1.Data

Dataset Format Instructions

2.Model

When performing keyword compression, the gpt2-xl is used by default.

When executing key statement compression, the e5 is used by default. It can be replaced as needed.

3.Keyword-based compression quick start

4.Key statement-based compression quick start