Skip to content

Conversation

@satyaog
Copy link
Member

@satyaog satyaog commented Sep 9, 2025

No description provided.

@satyaog satyaog force-pushed the fastapi branch 5 times, most recently from 5d470e6 to 0ebe96d Compare September 12, 2025 14:25
@satyaog satyaog mentioned this pull request Sep 12, 2025
@satyaog satyaog force-pushed the fastapi branch 13 times, most recently from ddcf07d to 4569181 Compare September 23, 2025 21:07
.cursorignore Outdated
@@ -0,0 +1 @@
**secrets**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as last time, this file should just be in .gitignore

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

pyproject.toml Outdated
sync = "uv sync --active --all-extras {args}"
make-secrets = "serieux patch -m 'paperoni.config:gifnoc_model' -f {env:PAPERONI_SECRETS} -o {env:GIFNOC_FILE} -p {env:SERIEUX_PASSWORD}"
make-secrets = [
'serieux patch -m "paperoni.config:gifnoc_model" -f "{env:PAPERONI_SECRETS}" -o "$(echo "{env:GIFNOC_FILE}" | cut -d"," -f1)" -p "{env:SERIEUX_PASSWORD}"',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serieux patch recognizes $SERIEUX_PASSWORD, if it's already in the environment there shouldn't be a need to use -p.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, redundant

Comment on lines 39 to 44
# Page of results to display
page: int = None
# Number of results to display per page
per_page: int = None
# Max number of papers to show
limit: int = 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think limit is best left to the user to deal with themselves. Personally I would use offset and size, with the semantics of returning papers offset to offset+size. I think it's a bit more flexible and allows the user to implement a limit themselves very easily if they want to.

Also, there should be a configurable limit to size or per_page, so that the user can't set it to e.g. 10000.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if per_page is None:
per_page = self.per_page
if limit is None:
limit = self.limit or len(iterable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len is not part of the Iterable protocol (you can't get the len of a generator, for example). Implementations of Sequence have lengths, for what it's worth.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya that was weird. I should have wrapped that into a try catch but I was a bit lazy

def user_id(self) -> str:
"""Get user id."""
email_hash = hashlib.sha256(self.email.encode()).hexdigest()
return f"{self.email.split('@')[0]}_{email_hash[:8]}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the user_id not just the email?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya user_id might not be the best name. This is only used to get a filesystem name frendly name

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed as it is not used anymore since users will not have their own work file but share a global work file. If we decide to decentralize the work later we can think of a proper solution

return SearchResponse(papers=results, total=len(coll.collection))

except Exception as e:
raise HTTPException(status_code=500, detail=f"Search failed: {str(e)}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FastAPI will produce error 500 even if you don't raise this. Error 500 is basically "uh oh something bad happened and we don't know why" and I don't think it's useful to give the user any more information. They won't be able to do anything about it and if we're unlucky the error message might contain something they shouldn't be allowed to see. It'll be in the logs for us to look at, and that's what matters.

If the user's request is bad, this should result in a different error code, typically 400 or in the 400-499 range, and the message should indicate clearly what the user did wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point, so fast api will also check the user's input for validity based on the dataclasses so for now we can let fast api handle the thinks for us?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably add proper logging in the app with a sentry integration in a future PR?

async def download_fulltext(request: DownloadFulltextRequest):
"""Download fulltext for a paper."""
try:
pdf = request.run()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is blocking, which is not ideal.

Also, I'm not sure we want users to have access to this command. At the very least cache_policy=no_download should be forced unless the user is an admin.

We do want to allow downloading PDFs from our collection, but I would rather serve it from /download/HASH.pdf and include these links in the paper data we return from /search.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I might be wrong but would we need to have all the commands from __main__ and under as async to control properly the blocking within the webapp? I think I've see a lib that hanlders the different async loops (default python and fastapi at least) so maybe we could have async fucntions around the code. For now I've forced the cache_policy=no_download policy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, all methods from Fetcher would have to be async (httpx supports that), which would contaminate basically everything. I think it'd be simpler to use a thread pool and send long requests to those thread pools. Alternatively, we could (should?) start them in a separate process, to ensure they can't take down the server.

Only a limited number of long calls should be available from the API anyway, and they should be private and admin-restricted. They are just a nice-to-have, so the rest of the framework shouldn't be architected around them. Paper downloading, insofar that we want to make it available to users, can be special-cased.

Comment on lines 470 to 490
@app.get(
"/work/include",
response_model=IncludeResponse,
)
async def work_include_papers(
request: IncludeRequest = None, user: User = Depends(get_current_admin)
):
"""Search for papers in the collection."""
request = request or IncludeRequest()

work_file = config.server.client_dir / user.user_id() / "work.json"

work = Work(command=None, work_file=work_file)

try:
added = request.run(work)

return IncludeResponse(total=added)

except Exception as e:
raise HTTPException(status_code=500, detail=f"Include failed: {str(e)}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how this is supposed to work, given that there appears to be no way to populate the work file in the first place.

My idea of the /include endpoint was rather that the request would contain a serialized Paper object (or a list) to add directly to the collection. If that paper includes an id then it would overwrite the existing paper.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably be a good idea for each user to have it's own work file if they want to work on it? And maybe have a separate one for the system's work file? Then the database will handle the rejection of including the outdated paper and ask the user to merge the data?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, if a user wants their own work file, can't they just run Paperoni locally? The use case for include is when we want to push a non-discoverable paper, a manual update, things like that.

Copy link
Member Author

@satyaog satyaog Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow. Running paperoni locally with an access to the paperoni database? Would that make everyone an admin?

But yes I thought we could actually use the user's work file to store work in progress data (non-discoverable paper, a manual update, ...)? But there's indeed an interface that's missing to add / update a paper to the work file

Copy link
Member Author

@satyaog satyaog Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added /work/add endpoint to allow users to add / modify a paper in the collection

Comment on lines 321 to 325
focus_file = config.server.client_dir / "focuses.yaml"

if not focus_file.exists():
focus_file.parent.mkdir(exist_ok=True, parents=True)
dump(Focuses, config.focuses, dest=focus_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should ideally point to the same file that the CLI will use. There is a danger of desynchronization here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I don't know how to retreive that from the config file right now as the config only contains the values (merged with the autofocuses) and not the actual source file :(. Do you think serieux could hold a special _serieux_source or something like that where we could get the source file?

So the idea right now would be to have that hardcoded file path here used in the config. Maybe not allow the user to set one and create a file relative to the config. Same goes for the autofocuses file which should not be set manually. Nothing is enforced right now as I wanted to discuss what could be the best solution

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we need the file for? We're not editing it, just the autofocuses.

Copy link
Member Author

@satyaog satyaog Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that would be particularly for the autofocuses file but since focuses and autofocuses are being merged into config.focuses (without a filename, only the merged content), I don't have a way to retrieve the filename, or the content, of only autofocuses. If we hardcode the focuses and autofocuses files to a name relative to the config, then I suppose we could use GIFNOC_FILE or the config argument to get the main configuration file location, and thus the autofocuses filename if it is relative to the main config file.

Comment on lines 387 to 388
except Exception:
raise HTTPException(status_code=401, detail="Google authentication failed.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to catch the exact exception type that failed auth raises, or at least log the exception to have better visibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the code a bit. Do you think rapporteur could be solution here? How would you configure it for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your modification is good, I don't think there's a need for anything more than that.

For logging exceptions, the standard logging.exception should suffice. We don't necessarily want a Slack notification when it happens. If we do want Slack notifications we need to work on a way to throttle the messages we report in order to make sure we don't flood the channel with errors every time an endpoint is hit. That could be done in rapporteur, I'm just not sure we want to.

* Use offset / size for paging request and uniformize the response with total / count and next_offset in the response
* do not initiate the download of pdfs from the REST api
* share the work file between the admins
* lock mechanism for user/admin operations
* fix search_papers total
* add /work/add
* organise focus files locations to be able to find the autofocuses file
@satyaog satyaog merged commit c9ba284 into mila-iqia:v3 Oct 7, 2025
2 checks passed
@satyaog satyaog deleted the fastapi branch October 7, 2025 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants