-
Notifications
You must be signed in to change notification settings - Fork 18
Add pdf refine #75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pdf refine #75
Conversation
b9742d4 to
3f5c822
Compare
|
|
||
| @dataclass | ||
| class Refine: | ||
| pdf: TaggedSubclass[Prompt] = field(default_factory=GenAIPrompt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if TaggedSubclass is needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If multiple classes are possible, yes.
src/paperoni/prompt.py
Outdated
| @dataclass | ||
| class GenAIPrompt(Prompt): | ||
| model: str | ||
| client: genai.Client = field(default_factory=genai.Client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now the env var GEMINI_API_KEY or GOOGLE_API_KEY have to be set for the genai.Client to be instantiated. Would you put the key in the config or elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put it in the config. It can be set to ${env:GEMINI_API_KEY} to enable the other kind of workflow.
6bfb6ad to
2cb09bc
Compare
src/paperoni/config.py
Outdated
| import gifnoc | ||
| from serieux import TaggedSubclass | ||
|
|
||
| from paperoni.prompt import GenAIPrompt, Prompt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, we should stick to relative imports.
|
|
||
| @dataclass | ||
| class Refine: | ||
| pdf: TaggedSubclass[Prompt] = field(default_factory=GenAIPrompt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If multiple classes are possible, yes.
src/paperoni/model/classes.py
Outdated
| @dataclass | ||
| class Paper: | ||
| title: str | ||
| title: str | None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which situation would the title ever be None?
|
|
||
| def cleanup_schema(schema: dict) -> dict: | ||
| # recusively clean up the schema removing $schema and unsupported additionalProperties | ||
| for key in ["$schema", "additionalProperties"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
serieux.schema won't set $schema when calling compile(root=False)
src/paperoni/prompt.py
Outdated
| @dataclass | ||
| class GenAIPrompt(Prompt): | ||
| model: str | ||
| client: genai.Client = field(default_factory=genai.Client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put it in the config. It can be set to ${env:GEMINI_API_KEY} to enable the other kind of workflow.
| # List of all affiliations present in the Deep Learning scientific paper | ||
| affiliations: list[Explained[str]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems redundant. Does it help with accuracy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used for validation purpose. It allows to check if the model correctly identified the affiliations list, ease the check of an affiliation by checking only the index next to the author instead of the affiliation name itself and helps in building the validation set for performance checking (in the utility I built I first build the affiliations list then enter 1 to n on the keyboard to set the affiliation). Not sure if it help or not but it would be useful to check. If it doesn't help, we could keep it only for building the validation set and remove it when prompting through paperoni
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have to be under source control? It can be downloaded when the tests run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of having this file under version control?
No description provided.