-
Notifications
You must be signed in to change notification settings - Fork 18
Add REST api #86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add REST api #86
Conversation
5d470e6 to
0ebe96d
Compare
ddcf07d to
4569181
Compare
.cursorignore
Outdated
| @@ -0,0 +1 @@ | |||
| **secrets** | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as last time, this file should just be in .gitignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
pyproject.toml
Outdated
| sync = "uv sync --active --all-extras {args}" | ||
| make-secrets = "serieux patch -m 'paperoni.config:gifnoc_model' -f {env:PAPERONI_SECRETS} -o {env:GIFNOC_FILE} -p {env:SERIEUX_PASSWORD}" | ||
| make-secrets = [ | ||
| 'serieux patch -m "paperoni.config:gifnoc_model" -f "{env:PAPERONI_SECRETS}" -o "$(echo "{env:GIFNOC_FILE}" | cut -d"," -f1)" -p "{env:SERIEUX_PASSWORD}"', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
serieux patch recognizes $SERIEUX_PASSWORD, if it's already in the environment there shouldn't be a need to use -p.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, redundant
src/paperoni/restapi.py
Outdated
| # Page of results to display | ||
| page: int = None | ||
| # Number of results to display per page | ||
| per_page: int = None | ||
| # Max number of papers to show | ||
| limit: int = 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think limit is best left to the user to deal with themselves. Personally I would use offset and size, with the semantics of returning papers offset to offset+size. I think it's a bit more flexible and allows the user to implement a limit themselves very easily if they want to.
Also, there should be a configurable limit to size or per_page, so that the user can't set it to e.g. 10000.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
src/paperoni/restapi.py
Outdated
| if per_page is None: | ||
| per_page = self.per_page | ||
| if limit is None: | ||
| limit = self.limit or len(iterable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len is not part of the Iterable protocol (you can't get the len of a generator, for example). Implementations of Sequence have lengths, for what it's worth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya that was weird. I should have wrapped that into a try catch but I was a bit lazy
src/paperoni/restapi.py
Outdated
| def user_id(self) -> str: | ||
| """Get user id.""" | ||
| email_hash = hashlib.sha256(self.email.encode()).hexdigest() | ||
| return f"{self.email.split('@')[0]}_{email_hash[:8]}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the user_id not just the email?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya user_id might not be the best name. This is only used to get a filesystem name frendly name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed as it is not used anymore since users will not have their own work file but share a global work file. If we decide to decentralize the work later we can think of a proper solution
src/paperoni/restapi.py
Outdated
| return SearchResponse(papers=results, total=len(coll.collection)) | ||
|
|
||
| except Exception as e: | ||
| raise HTTPException(status_code=500, detail=f"Search failed: {str(e)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FastAPI will produce error 500 even if you don't raise this. Error 500 is basically "uh oh something bad happened and we don't know why" and I don't think it's useful to give the user any more information. They won't be able to do anything about it and if we're unlucky the error message might contain something they shouldn't be allowed to see. It'll be in the logs for us to look at, and that's what matters.
If the user's request is bad, this should result in a different error code, typically 400 or in the 400-499 range, and the message should indicate clearly what the user did wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point, so fast api will also check the user's input for validity based on the dataclasses so for now we can let fast api handle the thinks for us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably add proper logging in the app with a sentry integration in a future PR?
src/paperoni/restapi.py
Outdated
| async def download_fulltext(request: DownloadFulltextRequest): | ||
| """Download fulltext for a paper.""" | ||
| try: | ||
| pdf = request.run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is blocking, which is not ideal.
Also, I'm not sure we want users to have access to this command. At the very least cache_policy=no_download should be forced unless the user is an admin.
We do want to allow downloading PDFs from our collection, but I would rather serve it from /download/HASH.pdf and include these links in the paper data we return from /search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I might be wrong but would we need to have all the commands from __main__ and under as async to control properly the blocking within the webapp? I think I've see a lib that hanlders the different async loops (default python and fastapi at least) so maybe we could have async fucntions around the code. For now I've forced the cache_policy=no_download policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, all methods from Fetcher would have to be async (httpx supports that), which would contaminate basically everything. I think it'd be simpler to use a thread pool and send long requests to those thread pools. Alternatively, we could (should?) start them in a separate process, to ensure they can't take down the server.
Only a limited number of long calls should be available from the API anyway, and they should be private and admin-restricted. They are just a nice-to-have, so the rest of the framework shouldn't be architected around them. Paper downloading, insofar that we want to make it available to users, can be special-cased.
src/paperoni/restapi.py
Outdated
| @app.get( | ||
| "/work/include", | ||
| response_model=IncludeResponse, | ||
| ) | ||
| async def work_include_papers( | ||
| request: IncludeRequest = None, user: User = Depends(get_current_admin) | ||
| ): | ||
| """Search for papers in the collection.""" | ||
| request = request or IncludeRequest() | ||
|
|
||
| work_file = config.server.client_dir / user.user_id() / "work.json" | ||
|
|
||
| work = Work(command=None, work_file=work_file) | ||
|
|
||
| try: | ||
| added = request.run(work) | ||
|
|
||
| return IncludeResponse(total=added) | ||
|
|
||
| except Exception as e: | ||
| raise HTTPException(status_code=500, detail=f"Include failed: {str(e)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how this is supposed to work, given that there appears to be no way to populate the work file in the first place.
My idea of the /include endpoint was rather that the request would contain a serialized Paper object (or a list) to add directly to the collection. If that paper includes an id then it would overwrite the existing paper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably be a good idea for each user to have it's own work file if they want to work on it? And maybe have a separate one for the system's work file? Then the database will handle the rejection of including the outdated paper and ask the user to merge the data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, if a user wants their own work file, can't they just run Paperoni locally? The use case for include is when we want to push a non-discoverable paper, a manual update, things like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow. Running paperoni locally with an access to the paperoni database? Would that make everyone an admin?
But yes I thought we could actually use the user's work file to store work in progress data (non-discoverable paper, a manual update, ...)? But there's indeed an interface that's missing to add / update a paper to the work file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added /work/add endpoint to allow users to add / modify a paper in the collection
src/paperoni/restapi.py
Outdated
| focus_file = config.server.client_dir / "focuses.yaml" | ||
|
|
||
| if not focus_file.exists(): | ||
| focus_file.parent.mkdir(exist_ok=True, parents=True) | ||
| dump(Focuses, config.focuses, dest=focus_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should ideally point to the same file that the CLI will use. There is a danger of desynchronization here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya I don't know how to retreive that from the config file right now as the config only contains the values (merged with the autofocuses) and not the actual source file :(. Do you think serieux could hold a special _serieux_source or something like that where we could get the source file?
So the idea right now would be to have that hardcoded file path here used in the config. Maybe not allow the user to set one and create a file relative to the config. Same goes for the autofocuses file which should not be set manually. Nothing is enforced right now as I wanted to discuss what could be the best solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we need the file for? We're not editing it, just the autofocuses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that would be particularly for the autofocuses file but since focuses and autofocuses are being merged into config.focuses (without a filename, only the merged content), I don't have a way to retrieve the filename, or the content, of only autofocuses. If we hardcode the focuses and autofocuses files to a name relative to the config, then I suppose we could use GIFNOC_FILE or the config argument to get the main configuration file location, and thus the autofocuses filename if it is relative to the main config file.
src/paperoni/restapi.py
Outdated
| except Exception: | ||
| raise HTTPException(status_code=401, detail="Google authentication failed.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to catch the exact exception type that failed auth raises, or at least log the exception to have better visibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified the code a bit. Do you think rapporteur could be solution here? How would you configure it for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your modification is good, I don't think there's a need for anything more than that.
For logging exceptions, the standard logging.exception should suffice. We don't necessarily want a Slack notification when it happens. If we do want Slack notifications we need to work on a way to throttle the messages we report in order to make sure we don't flood the channel with errors every time an endpoint is hit. That could be done in rapporteur, I'm just not sure we want to.
As serieux_deserialize was returning a MemCollection, nothing was dumped at save point
* lock mechanism for user/admin operations * fix search_papers total * add /work/add * organise focus files locations to be able to find the autofocuses file
No description provided.