SDAC

This is the repository for the paper "A Semantic-Driven Framework for Adaptive Compression in Document Retrieva".

0. Setup

To install the required packages, please run the following command:

pip install -r requirements.txt

To have a Pytorch version specific to your CUDA, install your version before running the above command.

Input test files:google drive

1.Data

We have prepared the original retrieval documents of the Natural Questions and TriviaQA datasets in the inputs folder, sourced from here.

Dataset Format Instructions

If you need to upload your own dataset, please follow the format below:

[
    {
        "question": "Question text",  // The question field, provide the specific question text
        "answers": [
            "Answer 1",  // The answer field, contains one or more possible answers
            "Answer 2"
        ],
        "ctxs": [  // Context array, each context contains the following fields
            {
                "id": "Context ID",  // Unique identifier for the context
                "title": "Context title",  // The title or name of the context
                "text": "Context content",  // The detailed text of the context, usually a paragraph or description
                "score": "Relevance score",  // (Optional) The relevance score of the context, higher values indicate stronger relevance
                "has_answer": true  // (Optional) Boolean value, indicates whether the context contains an answer
            },
            ...
        ]
    },
    ...
]

2.Model

When performing keyword compression, the gpt2-xl is used by default.

When executing key statement compression, the e5 is used by default. It can be replaced as needed.

3.Keyword-based compression quick start

To run Keyword-based compression, please use the following command

python keyword_compression.py 
--input_file  $INPUT_FILE
--output_file $OUTPUT_FILE
--model_name  $MODEL_NAME
--output_file $OUTPUT_FILE
--compression_ratio 0.5

4.Key statement-based compression quick start

To run Key statement-based compression, please use the following command

python key_statement_compression.py 
--input_file  $INPUT_FILE
--output_file $OUTPUT_FILE
--model_name  $MODEL_NAME
--output_file $OUTPUT_DIR
--threshold 0.5

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.idea		.idea
models/nq_extractive_compressor		models/nq_extractive_compressor
Abstract.py		Abstract.py
README.md		README.md
keysentence-based.py		keysentence-based.py
keyword-based.py		keyword-based.py
requirements.txt		requirements.txt
retrieval.py		retrieval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SDAC

0. Setup

1.Data

Dataset Format Instructions

2.Model

When performing keyword compression, the gpt2-xl is used by default.

When executing key statement compression, the e5 is used by default. It can be replaced as needed.

3.Keyword-based compression quick start

4.Key statement-based compression quick start

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SDAC

0. Setup

1.Data

Dataset Format Instructions

2.Model

When performing keyword compression, the gpt2-xl is used by default.

When executing key statement compression, the e5 is used by default. It can be replaced as needed.

3.Keyword-based compression quick start

4.Key statement-based compression quick start

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages