add citation parsers #1955

emrgnt-cmplxty · 2025-02-10T03:20:44Z

Important

Add citation parsing and mapping functionality to enhance RAG responses with structured citation metadata.

Behavior:
- Adds citation parsing functions my_extract_citations, reassign_citations_in_order, and my_map_citations_to_sources in base_utils.py.
- Updates rag() in retrieval_service.py to include citation extraction, re-labeling, and mapping to sources.
- Modifies RAGResponse in responses.py to include citations field.
Models:
- Introduces Citation model in responses.py to represent citation metadata.
- Updates __init__.py in api/models to include Citation in exports.
Misc:
- Renames reasoning_agent to reasoningAgent in retrieval.ts.
- Removes unused modalities code in llm.py.
- Version bump in package.json from 0.4.24 to 0.4.25.

^{This description was created by}^{for 4ad0abd. It will automatically update as commits are pushed.}

ellipsis-dev

❌ Changes requested. Reviewed everything up to 4ad0abd in 2 minutes and 48 seconds

More details

Looked at 715 lines of code in 11 files
Skipped 0 files when reviewing.
Skipped posting 11 drafted comments based on config settings.

1. py/shared/utils/base_utils.py:81

Draft comment:
Consider adding a comment about the chosen regex for citation extraction. It might match unintended patterns in text.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50%
None

2. py/shared/utils/base_utils.py:139

Draft comment:
Replacing text from end to start is correct; consider noting potential issues if replacement lengths differ (though descending order minimizes index shift issues).
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50%
None

3. py/shared/utils/base_utils.py:155

Draft comment:
Mapping re-extracted citations by new index assumes uniqueness. Consider validating that new citation indices are unique to avoid potential key collisions.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 50%
None

4. py/shared/utils/base_utils.py:422

Draft comment:
In generate_id, it's safer to check 'if label is not None' rather than 'if label != None' to follow Python best practices.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

5. py/shared/utils/base_utils.py:185

Draft comment:
In my_map_citations_to_sources, consider logging a warning when a citation's index is out of range. This could help diagnose issues when a citation remains unmapped.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 50%
None

6. py/shared/utils/base_utils.py:516

Draft comment:
Overall, the citation parsing and mapping functions work as intended. Consider adding type annotations or docstrings for parameters to further improve maintainability.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50%
None

7. js/sdk/src/v3/clients/retrieval.ts:192

Draft comment:
Duplicate inclusion of the 'tools' property in the request payload. Remove one of the duplicate '...(options.tools && { tools: options.tools })' blocks to avoid conflicts.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

8. py/core/main/services/retrieval_service.py:523

Draft comment:
The new citation extraction and remapping workflow is well integrated. Consider adding unit tests for edge cases (e.g., overlapping or nested bracket references) to ensure robustness.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50%
While the comment is about code that was changed in the diff (the new citation handling), suggesting unit tests is more of a general best practice suggestion rather than a specific code issue that needs to be fixed. The comment is somewhat speculative ("edge cases...") rather than pointing out a concrete problem. We don't have enough context to know if tests already exist elsewhere.
The suggestion to add tests could be valuable since this is new code handling potentially complex citation parsing. Edge cases in text processing can be a real source of bugs.
However, per the rules, we should not make speculative comments or suggestions without strong evidence of an issue. We don't know if tests already exist, and we don't have concrete evidence of any problems with the current implementation.
The comment should be deleted as it is a general best practice suggestion rather than a specific, actionable issue with strong evidence.

9. py/shared/utils/base_utils.py:38

Draft comment:
The _expand_citation_span_to_sentence function uses a simple loop for sentence boundary detection. Consider using a more robust regex-based approach or a natural language library to handle edge cases in sentence detection.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
The current implementation is intentionally simple and works for the basic case of finding citation boundaries. The docstring explicitly acknowledges it's a simple heuristic that can be refined if needed. The comment suggests a more complex solution without demonstrating why the current approach is insufficient. Over-engineering should be avoided unless there's a clear need.
The comment could be right that there are edge cases the simple approach might miss. Perhaps there are important sentence boundary cases that would cause issues in practice.
While edge cases may exist, there's no evidence provided that they're actually causing problems. The code is intentionally simple and acknowledges its limitations. Adding complexity without demonstrated need violates YAGNI principles.
Delete the comment. While the suggestion is technically valid, it's premature optimization without evidence of actual problems. The code acknowledges its simple nature and leaves room for future refinement if needed.

10. py/shared/utils/base_utils.py:240

Draft comment:
In my_map_citations_to_sources, when mapping graph search results, the call to source_obj.content.model_dump() assumes the content object has this method. Consider adding a safe check or exception handling to avoid runtime errors if the method is absent.
Reason this comment was not posted:
Comment looked like it was already resolved.

11. py/shared/utils/base_utils.py:154

Draft comment:
The reassign_citations_in_order function uses character array replacement for modifying citation references. This approach works but can be sensitive if the text changes unexpectedly. Please ensure thorough tests cover various edge cases.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 20% <= threshold 50%
The comment is asking the PR author to ensure thorough tests cover various edge cases, which violates the rule against asking for confirmation or ensuring behavior. It does provide some insight into the sensitivity of the approach used, but the main focus is on ensuring tests, which is not allowed.

Workflow ID: wflow_zSvMP03AjMS5e7Vk

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ellipsis-dev · 2025-02-10T19:26:43Z

py/shared/api/models/retrieval/responses.py

+                    "graph_search_results": [],
+                    "web_search_results": [],
+                },
+                "citations": {


The JSON schema example for Citation includes keys like 'uri', 'title', and 'license' which are not defined in the Citation model. Please update the schema example or extend the model to include these fields.

emrgnt-cmplxty added 5 commits February 9, 2025 19:20

add citation parsers

a3fa75c

rm cruft

c85a35b

add snippet start / end index

d85812f

Make citation class

adcc136

fix js reasoning agent name

4ad0abd

emrgnt-cmplxty marked this pull request as ready for review February 10, 2025 19:23

emrgnt-cmplxty merged commit f2f7065 into main Feb 10, 2025
11 of 15 checks passed

ellipsis-dev bot reviewed Feb 10, 2025

View reviewed changes

emrgnt-cmplxty deleted the feature/refactor-rag-reply-add-citations branch February 12, 2025 20:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add citation parsers #1955

add citation parsers #1955

Uh oh!

emrgnt-cmplxty commented Feb 10, 2025 •

edited by ellipsis-dev bot

Loading

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

ellipsis-dev bot Feb 10, 2025

Uh oh!

Uh oh!

add citation parsers #1955

add citation parsers #1955

Uh oh!

Conversation

emrgnt-cmplxty commented Feb 10, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

emrgnt-cmplxty commented Feb 10, 2025 •

edited by ellipsis-dev bot

Loading