Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add citation parsers #1955

Merged
merged 5 commits into from
Feb 10, 2025
Merged

Conversation

emrgnt-cmplxty
Copy link
Contributor

@emrgnt-cmplxty emrgnt-cmplxty commented Feb 10, 2025

Important

Add citation parsing and mapping functionality to enhance RAG responses with structured citation metadata.

  • Behavior:
    • Adds citation parsing functions my_extract_citations, reassign_citations_in_order, and my_map_citations_to_sources in base_utils.py.
    • Updates rag() in retrieval_service.py to include citation extraction, re-labeling, and mapping to sources.
    • Modifies RAGResponse in responses.py to include citations field.
  • Models:
    • Introduces Citation model in responses.py to represent citation metadata.
    • Updates __init__.py in api/models to include Citation in exports.
  • Misc:
    • Renames reasoning_agent to reasoningAgent in retrieval.ts.
    • Removes unused modalities code in llm.py.
    • Version bump in package.json from 0.4.24 to 0.4.25.

This description was created by Ellipsis for 4ad0abd. It will automatically update as commits are pushed.

@emrgnt-cmplxty emrgnt-cmplxty marked this pull request as ready for review February 10, 2025 19:23
@emrgnt-cmplxty emrgnt-cmplxty merged commit f2f7065 into main Feb 10, 2025
11 of 15 checks passed
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Reviewed everything up to 4ad0abd in 2 minutes and 48 seconds

More details
  • Looked at 715 lines of code in 11 files
  • Skipped 0 files when reviewing.
  • Skipped posting 11 drafted comments based on config settings.
1. py/shared/utils/base_utils.py:81
  • Draft comment:
    Consider adding a comment about the chosen regex for citation extraction. It might match unintended patterns in text.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50%
    None
2. py/shared/utils/base_utils.py:139
  • Draft comment:
    Replacing text from end to start is correct; consider noting potential issues if replacement lengths differ (though descending order minimizes index shift issues).
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50%
    None
3. py/shared/utils/base_utils.py:155
  • Draft comment:
    Mapping re-extracted citations by new index assumes uniqueness. Consider validating that new citation indices are unique to avoid potential key collisions.
  • Reason this comment was not posted:
    Confidence changes required: 50% <= threshold 50%
    None
4. py/shared/utils/base_utils.py:422
  • Draft comment:
    In generate_id, it's safer to check 'if label is not None' rather than 'if label != None' to follow Python best practices.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
5. py/shared/utils/base_utils.py:185
  • Draft comment:
    In my_map_citations_to_sources, consider logging a warning when a citation's index is out of range. This could help diagnose issues when a citation remains unmapped.
  • Reason this comment was not posted:
    Confidence changes required: 50% <= threshold 50%
    None
6. py/shared/utils/base_utils.py:516
  • Draft comment:
    Overall, the citation parsing and mapping functions work as intended. Consider adding type annotations or docstrings for parameters to further improve maintainability.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50%
    None
7. js/sdk/src/v3/clients/retrieval.ts:192
  • Draft comment:
    Duplicate inclusion of the 'tools' property in the request payload. Remove one of the duplicate '...(options.tools && { tools: options.tools })' blocks to avoid conflicts.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
8. py/core/main/services/retrieval_service.py:523
  • Draft comment:
    The new citation extraction and remapping workflow is well integrated. Consider adding unit tests for edge cases (e.g., overlapping or nested bracket references) to ensure robustness.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50%
    While the comment is about code that was changed in the diff (the new citation handling), suggesting unit tests is more of a general best practice suggestion rather than a specific code issue that needs to be fixed. The comment is somewhat speculative ("edge cases...") rather than pointing out a concrete problem. We don't have enough context to know if tests already exist elsewhere.
    The suggestion to add tests could be valuable since this is new code handling potentially complex citation parsing. Edge cases in text processing can be a real source of bugs.
    However, per the rules, we should not make speculative comments or suggestions without strong evidence of an issue. We don't know if tests already exist, and we don't have concrete evidence of any problems with the current implementation.
    The comment should be deleted as it is a general best practice suggestion rather than a specific, actionable issue with strong evidence.
9. py/shared/utils/base_utils.py:38
  • Draft comment:
    The _expand_citation_span_to_sentence function uses a simple loop for sentence boundary detection. Consider using a more robust regex-based approach or a natural language library to handle edge cases in sentence detection.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
    The current implementation is intentionally simple and works for the basic case of finding citation boundaries. The docstring explicitly acknowledges it's a simple heuristic that can be refined if needed. The comment suggests a more complex solution without demonstrating why the current approach is insufficient. Over-engineering should be avoided unless there's a clear need.
    The comment could be right that there are edge cases the simple approach might miss. Perhaps there are important sentence boundary cases that would cause issues in practice.
    While edge cases may exist, there's no evidence provided that they're actually causing problems. The code is intentionally simple and acknowledges its limitations. Adding complexity without demonstrated need violates YAGNI principles.
    Delete the comment. While the suggestion is technically valid, it's premature optimization without evidence of actual problems. The code acknowledges its simple nature and leaves room for future refinement if needed.
10. py/shared/utils/base_utils.py:240
  • Draft comment:
    In my_map_citations_to_sources, when mapping graph search results, the call to source_obj.content.model_dump() assumes the content object has this method. Consider adding a safe check or exception handling to avoid runtime errors if the method is absent.
  • Reason this comment was not posted:
    Comment looked like it was already resolved.
11. py/shared/utils/base_utils.py:154
  • Draft comment:
    The reassign_citations_in_order function uses character array replacement for modifying citation references. This approach works but can be sensitive if the text changes unexpectedly. Please ensure thorough tests cover various edge cases.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 20% <= threshold 50%
    The comment is asking the PR author to ensure thorough tests cover various edge cases, which violates the rule against asking for confirmation or ensuring behavior. It does provide some insight into the sensitivity of the approach used, but the main focus is on ensuring tests, which is not allowed.

Workflow ID: wflow_zSvMP03AjMS5e7Vk


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

"graph_search_results": [],
"web_search_results": [],
},
"citations": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON schema example for Citation includes keys like 'uri', 'title', and 'license' which are not defined in the Citation model. Please update the schema example or extend the model to include these fields.

@emrgnt-cmplxty emrgnt-cmplxty deleted the feature/refactor-rag-reply-add-citations branch February 12, 2025 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant