Add Citation Tool

### Feature Type

New functionality

### Problem Statement

The literature discovery pipeline will benefit from an independent citation tool capable of generating citations for a given chunk of text. Currently, the only citation generation occurs within STORM, and it suffers from inaccuracies, improper formatting, hallucinations, and reliance on unreliable prompting.  A dedicated citation tool can ensure a structured, reliable format and deliver more accurate outputs.

### Proposed Solution

The pipeline requires a dedicated citation tool. This tool would take a chunk of text and its corresponding index of documents as input, and it would return the cited text along with an annotated list of references. The possibility of asynchronously generating citations for multiple chunks of text will require a collator and corrector for the index of each reference. This tool can be integrated with agents that require citations, such as the literature review and summarizer agents. It will function independently of the LLM used to generate the text to help prevent incorrectly generated citations.

```python
@tool
def citation_tool(text: str, index: List) -> Tuple[str, List]:
    """Cites text; returns cited text, citations."""
    # cite text
    return cited_text, citations
```

### Alternative Solutions

The above methods can serve as a temporary placeholder until we integrate the distillation-based attribution workflow.

### User Benefits

Users will be able to view the sources used to generate the text, along with corresponding links to those sources if available.

### Implementation Ideas

The tool could be implemented using the following methods:

- [Context-Cite:](https://arxiv.org/abs/2409.00729)  ContextCite employs a surrogate model to approximate how a language model's response is affected by the inclusion or exclusion of different parts of the context. It's available as a tool that provides attribution scores, indicating which parts of the context correspond to the output.
- [PyTorch's Captum Library](https://captum.ai/tutorials/Llama2_LLM_Attribution): Captum offers perturbation-based and gradient-based algorithms that reveal how input prompts influence content generation.
- Prompt-based: We could prompt the model to generate citations based on the document index and use Pydantic to structure the outputs. However, this approach is likely to be error-prone and should not be relied on.

### Contribution

- [x] I'm willing to submit a PR for this feature
- [x] I'm willing to test this feature
- [x] I'm willing to help document this feature

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Citation Tool #38

Feature Type

Problem Statement

Proposed Solution

Alternative Solutions

User Benefits

Implementation Ideas

Contribution

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add Citation Tool #38

Description

Feature Type

Problem Statement

Proposed Solution

Alternative Solutions

User Benefits

Implementation Ideas

Contribution

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions