Add REST api #86

satyaog · 2025-09-09T21:14:52Z

No description provided.

breuleux · 2025-09-26T13:50:44Z

.cursorignore

@@ -0,0 +1 @@
+**secrets**


Same comment as last time, this file should just be in .gitignore

breuleux · 2025-09-26T13:52:04Z

pyproject.toml

 sync = "uv sync --active --all-extras {args}"
-make-secrets = "serieux patch -m 'paperoni.config:gifnoc_model' -f {env:PAPERONI_SECRETS} -o {env:GIFNOC_FILE} -p {env:SERIEUX_PASSWORD}"
+make-secrets = [
+    'serieux patch -m "paperoni.config:gifnoc_model" -f "{env:PAPERONI_SECRETS}" -o "$(echo "{env:GIFNOC_FILE}" | cut -d"," -f1)" -p "{env:SERIEUX_PASSWORD}"',


serieux patch recognizes $SERIEUX_PASSWORD, if it's already in the environment there shouldn't be a need to use -p.

Ah yes, redundant

breuleux · 2025-09-26T14:07:07Z

src/paperoni/restapi.py

+    # Page of results to display
+    page: int = None
+    # Number of results to display per page
+    per_page: int = None
+    # Max number of papers to show
+    limit: int = 100


I think limit is best left to the user to deal with themselves. Personally I would use offset and size, with the semantics of returning papers offset to offset+size. I think it's a bit more flexible and allows the user to implement a limit themselves very easily if they want to.

Also, there should be a configurable limit to size or per_page, so that the user can't set it to e.g. 10000.

breuleux · 2025-09-26T14:18:09Z

src/paperoni/restapi.py

+        if per_page is None:
+            per_page = self.per_page
+        if limit is None:
+            limit = self.limit or len(iterable)


len is not part of the Iterable protocol (you can't get the len of a generator, for example). Implementations of Sequence have lengths, for what it's worth.

Ya that was weird. I should have wrapped that into a try catch but I was a bit lazy

breuleux · 2025-09-26T14:20:04Z

src/paperoni/restapi.py

+    def user_id(self) -> str:
+        """Get user id."""
+        email_hash = hashlib.sha256(self.email.encode()).hexdigest()
+        return f"{self.email.split('@')[0]}_{email_hash[:8]}"


Why is the user_id not just the email?

Ya user_id might not be the best name. This is only used to get a filesystem name frendly name

Removed as it is not used anymore since users will not have their own work file but share a global work file. If we decide to decentralize the work later we can think of a proper solution

breuleux · 2025-09-26T14:58:55Z

src/paperoni/restapi.py

+            return SearchResponse(papers=results, total=len(coll.collection))
+
+        except Exception as e:
+            raise HTTPException(status_code=500, detail=f"Search failed: {str(e)}")


FastAPI will produce error 500 even if you don't raise this. Error 500 is basically "uh oh something bad happened and we don't know why" and I don't think it's useful to give the user any more information. They won't be able to do anything about it and if we're unlucky the error message might contain something they shouldn't be allowed to see. It'll be in the logs for us to look at, and that's what matters.

If the user's request is bad, this should result in a different error code, typically 400 or in the 400-499 range, and the message should indicate clearly what the user did wrong.

Ah good point, so fast api will also check the user's input for validity based on the dataclasses so for now we can let fast api handle the thinks for us?

We can probably add proper logging in the app with a sentry integration in a future PR?

breuleux · 2025-09-26T15:13:12Z

src/paperoni/restapi.py

+    async def download_fulltext(request: DownloadFulltextRequest):
+        """Download fulltext for a paper."""
+        try:
+            pdf = request.run()


This is blocking, which is not ideal.

Also, I'm not sure we want users to have access to this command. At the very least cache_policy=no_download should be forced unless the user is an admin.

We do want to allow downloading PDFs from our collection, but I would rather serve it from /download/HASH.pdf and include these links in the paper data we return from /search.

Yes I might be wrong but would we need to have all the commands from __main__ and under as async to control properly the blocking within the webapp? I think I've see a lib that hanlders the different async loops (default python and fastapi at least) so maybe we could have async fucntions around the code. For now I've forced the cache_policy=no_download policy

In that case, all methods from Fetcher would have to be async (httpx supports that), which would contaminate basically everything. I think it'd be simpler to use a thread pool and send long requests to those thread pools. Alternatively, we could (should?) start them in a separate process, to ensure they can't take down the server.

Only a limited number of long calls should be available from the API anyway, and they should be private and admin-restricted. They are just a nice-to-have, so the rest of the framework shouldn't be architected around them. Paper downloading, insofar that we want to make it available to users, can be special-cased.

breuleux · 2025-09-26T15:19:24Z

src/paperoni/restapi.py

+    @app.get(
+        "/work/include",
+        response_model=IncludeResponse,
+    )
+    async def work_include_papers(
+        request: IncludeRequest = None, user: User = Depends(get_current_admin)
+    ):
+        """Search for papers in the collection."""
+        request = request or IncludeRequest()
+
+        work_file = config.server.client_dir / user.user_id() / "work.json"
+
+        work = Work(command=None, work_file=work_file)
+
+        try:
+            added = request.run(work)
+
+            return IncludeResponse(total=added)
+
+        except Exception as e:
+            raise HTTPException(status_code=500, detail=f"Include failed: {str(e)}")


Not sure how this is supposed to work, given that there appears to be no way to populate the work file in the first place.

My idea of the /include endpoint was rather that the request would contain a serialized Paper object (or a list) to add directly to the collection. If that paper includes an id then it would overwrite the existing paper.

I would probably be a good idea for each user to have it's own work file if they want to work on it? And maybe have a separate one for the system's work file? Then the database will handle the rejection of including the outdated paper and ask the user to merge the data?

I mean, if a user wants their own work file, can't they just run Paperoni locally? The use case for include is when we want to push a non-discoverable paper, a manual update, things like that.

I'm not sure I follow. Running paperoni locally with an access to the paperoni database? Would that make everyone an admin?

But yes I thought we could actually use the user's work file to store work in progress data (non-discoverable paper, a manual update, ...)? But there's indeed an interface that's missing to add / update a paper to the work file

Added /work/add endpoint to allow users to add / modify a paper in the collection

breuleux · 2025-09-26T15:27:11Z

src/paperoni/restapi.py

+    focus_file = config.server.client_dir / "focuses.yaml"
+
+    if not focus_file.exists():
+        focus_file.parent.mkdir(exist_ok=True, parents=True)
+        dump(Focuses, config.focuses, dest=focus_file)


This should ideally point to the same file that the CLI will use. There is a danger of desynchronization here.

Ya I don't know how to retreive that from the config file right now as the config only contains the values (merged with the autofocuses) and not the actual source file :(. Do you think serieux could hold a special _serieux_source or something like that where we could get the source file?

So the idea right now would be to have that hardcoded file path here used in the config. Maybe not allow the user to set one and create a file relative to the config. Same goes for the autofocuses file which should not be set manually. Nothing is enforced right now as I wanted to discuss what could be the best solution

What do we need the file for? We're not editing it, just the autofocuses.

Yes that would be particularly for the autofocuses file but since focuses and autofocuses are being merged into config.focuses (without a filename, only the merged content), I don't have a way to retrieve the filename, or the content, of only autofocuses. If we hardcode the focuses and autofocuses files to a name relative to the config, then I suppose we could use GIFNOC_FILE or the config argument to get the main configuration file location, and thus the autofocuses filename if it is relative to the main config file.

breuleux · 2025-09-26T15:30:46Z

src/paperoni/restapi.py

+        except Exception:
+            raise HTTPException(status_code=401, detail="Google authentication failed.")


We might want to catch the exact exception type that failed auth raises, or at least log the exception to have better visibility.

Modified the code a bit. Do you think rapporteur could be solution here? How would you configure it for now?

Your modification is good, I don't think there's a need for anything more than that.

For logging exceptions, the standard logging.exception should suffice. We don't necessarily want a Slack notification when it happens. If we do want Slack notifications we need to work on a way to throttle the messages we report in order to make sure we don't flood the channel with errors every time an endpoint is hit. That could be done in rapporteur, I'm just not sure we want to.

As serieux_deserialize was returning a MemCollection, nothing was dumped at save point

* Use offset / size for paging request and uniformize the response with total / count and next_offset in the response * do not initiate the download of pdfs from the REST api * share the work file between the admins

* lock mechanism for user/admin operations * fix search_papers total * add /work/add * organise focus files locations to be able to find the autofocuses file

satyaog force-pushed the fastapi branch 5 times, most recently from 5d470e6 to 0ebe96d Compare September 12, 2025 14:25

satyaog mentioned this pull request Sep 12, 2025

Mongo #87

Merged

satyaog force-pushed the fastapi branch 13 times, most recently from ddcf07d to 4569181 Compare September 23, 2025 21:07

breuleux reviewed Sep 26, 2025

View reviewed changes

satyaog force-pushed the fastapi branch from eb55f2a to e5cacbf Compare September 26, 2025 18:32

satyaog added 9 commits October 2, 2025 10:37

Add fastapi on collection search

015d8dd

Add REST entry points on fulltext locate and download

c995f75

Add REST entry points on work view and include

2267891

Add dummy token management

70109be

Add Google OAuth

ded0c52

Add autofocus endpoint

6790730

Update uv.lock

ef27417

Fix filecoll not being dumped

2d838ae

As serieux_deserialize was returning a MemCollection, nothing was dumped at save point

Fix work/view entry point

eef6d38

satyaog added 4 commits October 2, 2025 10:37

Fix tests

b16988d

Fix search breaking paging by returning a list

8176efc

PR review

a5f43ee

* Use offset / size for paging request and uniformize the response with total / count and next_offset in the response * do not initiate the download of pdfs from the REST api * share the work file between the admins

PR review log all errors in fastapi app

b355db9

satyaog force-pushed the fastapi branch from e5cacbf to dc686d0 Compare October 3, 2025 16:52

PR review user work file

bc0be3f

* lock mechanism for user/admin operations * fix search_papers total * add /work/add * organise focus files locations to be able to find the autofocuses file

satyaog force-pushed the fastapi branch from dc686d0 to bc0be3f Compare October 3, 2025 17:55

Add processes pool in server

0cb9367

satyaog force-pushed the fastapi branch from 75d61ab to 0cb9367 Compare October 6, 2025 20:33

Fix useless dependency

9e170cc

satyaog merged commit c9ba284 into mila-iqia:v3 Oct 7, 2025
2 checks passed

satyaog deleted the fastapi branch October 7, 2025 19:46

		except Exception:
		raise HTTPException(status_code=401, detail="Google authentication failed.")

		@@ -0,0 +1 @@
		secrets

Add REST api #86

Add REST api #86

Uh oh!

Conversation

satyaog commented Sep 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

satyaog Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

satyaog Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

satyaog Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

satyaog Oct 1, 2025 •

edited

Loading

satyaog Oct 3, 2025 •

edited

Loading

satyaog Oct 1, 2025 •

edited

Loading