Skip to content

[QUO-899][QUO-898] Support document metadata in Q-SDK#81

Merged
crekhari merged 14 commits intomainfrom
feature/documents-metadata
Mar 31, 2025
Merged

[QUO-899][QUO-898] Support document metadata in Q-SDK#81
crekhari merged 14 commits intomainfrom
feature/documents-metadata

Conversation

@crekhari
Copy link
Contributor

@crekhari crekhari commented Mar 27, 2025

Example code with document metadata:

quotient_logger.log(
    user_query="How do I cook a goose?",
    model_output="The capital of France is Paris",
    documents=[
        "Here is an excellent goose recipe...",
        {"page_content": "Sample document", "metadata": {"source": "google"}},
        {"page_content": "Sample document2"}
    ]
)

Validation errors are shown via logger.

@linear
Copy link

linear bot commented Mar 27, 2025

QUO-899 Modify SDK document arg to accept a union of List of string and list of doc types

List of Doc type - should include page_content, and metadata properties

or a a list of strings - what it currently is

class Document:
   content: str
   metadata: dict

List[Document]

QUO-898 Add documents metadata to detections endpoint

Allow users to pass metadata about their documents (e.g. as a dictionary). For example - Tavily mentioned they want to report where documents are being retrieved from.

Copy link
Member

@freddiev4 freddiev4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crekhari I have a suggestion down below. Also can you make the equivalent PR in quotient-typescript since we have that now? Or ask @waldnzwrld for help

Comment on lines +266 to +269
self.logger.error(f"Documents must be a list of strings or dictionaries with 'page_content' and optional 'metadata' keys. Metadata keys must be strings")
return None
else:
self.logger.error(f"Documents must be a list of strings or dictionaries with 'page_content' and optional 'metadata' keys, got {type(doc)}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something we need to keep in mind: we should not log these errors if there is nothing that the user can do about them, as that will be very confusing and noisy for people to see.

My understanding is these log if we get back basically invalid data from the API and there's a mismatch with the what the client expects (?)

I would suggest we don't log anything here, and instead raise an exception with information about what was returned from the API and what the client expected, and suggest what the user should do about it.

Copy link
Contributor Author

@crekhari crekhari Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked in office, and user errors should be surfaced as quickly and "loudly" as possible. Logging to stout is not as loud as raising exceptions. We want this to be loud because it would suck if the user doesnt see the stout error msg because its cluttered with all their other logs, and then are wondering why logs are not being saved.

from dataclasses import dataclass
from datetime import datetime
import time
from pydantic import BaseModel

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we importing pydantic here? All other classes in the sdk are instantiated without it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@waldnzwrld had a quick chat in the office: it'd be good to use pydantic for client-side validation over the dataclasses. Dataclasses are useful but don't do type validation.

@crekhari can you make a separate PR to replace stuff with pydantic entirely?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be better for us to replace everything in one go rather than doing it piece-meal

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wholeheartedly agree that if we're going to do it, we should just go for it.

If nobody has opened an issue for it, I will

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for opening the issue, Ill assign it to myself for now, but its a lowish priority for me rn

Copy link

@waldnzwrld waldnzwrld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please update the tests in tests/resources/logs and tests/client to make sure your changes are behaving as expected?

@crekhari
Copy link
Contributor Author

Could you please update the tests in tests/resources/logs and tests/client to make sure your changes are behaving as expected?

Will do!

@@ -1,2 +1,3 @@
[pytest]
addopts = -s
asyncio_default_fixture_loop_scope = function
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a Deprecation warning when running tests (poetry run pytest --cov=quotientai tests/ --cov-report term-missing):

PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

Setting it explicitly makes it go away. Setting the event loop scope to function means a new event loop will be created for each test function that uses an async fixture

Comment on lines +265 to +269
except Exception as _:
self._logger.error(f"Documents must be a list of strings or dictionaries with 'page_content' and optional 'metadata' keys. Metadata keys must be strings")
return None
else:
self._logger.error(f"Documents must be a list of strings or dictionaries with 'page_content' and optional 'metadata' keys, got {type(doc)}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still needs to be addressed? #81 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

crekhari and others added 2 commits March 31, 2025 10:34
Co-authored-by: Freddie Vargus <freddie@quotientai.co>
@waldnzwrld waldnzwrld dismissed their stale review March 31, 2025 15:10

Tests are in place and looking good

Copy link

@waldnzwrld waldnzwrld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non blocking but it is good to make sure that we have a happy case path in tests as well

mock_random.return_value = 0.6
assert logger._should_sample() is False

def test_log_with_invalid_document_dict(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a happy case test, (A test where the document data is valid) for both the sync and async client arch

@crekhari crekhari requested a review from freddiev4 March 31, 2025 16:54
Copy link
Member

@freddiev4 freddiev4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM one request below and think good to 🚢

Comment on lines +265 to +267
raise QuotientAIError("Documents must be a list of strings or dictionaries with 'page_content' and optional 'metadata' keys. Metadata keys must be strings")
else:
raise QuotientAIError(f"Documents must be a list of strings or dictionaries with 'page_content' and optional 'metadata' keys, got {type(doc)}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make these more actionable? Like in TS

@freddiev4
Copy link
Member

freddiev4 commented Mar 31, 2025

After this get's merged, can you cut a release @crekhari ? Do you know what the steps are?

@crekhari crekhari merged commit 5646dc4 into main Mar 31, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants