Skip to content

Conversation

@satyaog
Copy link
Member

@satyaog satyaog commented Aug 19, 2025

No description provided.

@satyaog satyaog force-pushed the pdf branch 4 times, most recently from b9742d4 to 3f5c822 Compare August 20, 2025 14:48

@dataclass
class Refine:
pdf: TaggedSubclass[Prompt] = field(default_factory=GenAIPrompt)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if TaggedSubclass is needed here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If multiple classes are possible, yes.

@dataclass
class GenAIPrompt(Prompt):
model: str
client: genai.Client = field(default_factory=genai.Client)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the env var GEMINI_API_KEY or GOOGLE_API_KEY have to be set for the genai.Client to be instantiated. Would you put the key in the config or elsewhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put it in the config. It can be set to ${env:GEMINI_API_KEY} to enable the other kind of workflow.

import gifnoc
from serieux import TaggedSubclass

from paperoni.prompt import GenAIPrompt, Prompt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, we should stick to relative imports.


@dataclass
class Refine:
pdf: TaggedSubclass[Prompt] = field(default_factory=GenAIPrompt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If multiple classes are possible, yes.

@dataclass
class Paper:
title: str
title: str | None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which situation would the title ever be None?


def cleanup_schema(schema: dict) -> dict:
# recusively clean up the schema removing $schema and unsupported additionalProperties
for key in ["$schema", "additionalProperties"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serieux.schema won't set $schema when calling compile(root=False)

@dataclass
class GenAIPrompt(Prompt):
model: str
client: genai.Client = field(default_factory=genai.Client)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put it in the config. It can be set to ${env:GEMINI_API_KEY} to enable the other kind of workflow.

Comment on lines +56 to +57
# List of all affiliations present in the Deep Learning scientific paper
affiliations: list[Explained[str]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems redundant. Does it help with accuracy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used for validation purpose. It allows to check if the model correctly identified the affiliations list, ease the check of an affiliation by checking only the index next to the author instead of the affiliation name itself and helps in building the validation set for performance checking (in the utility I built I first build the affiliations list then enter 1 to n on the keyboard to set the affiliation). Not sure if it help or not but it would be useful to check. If it doesn't help, we could keep it only for building the validation set and remove it when prompting through paperoni

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to be under source control? It can be downloaded when the tests run.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of having this file under version control?

@breuleux breuleux merged commit c2467e9 into mila-iqia:v3 Aug 26, 2025
2 checks passed
@satyaog satyaog deleted the pdf branch October 7, 2025 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants