Searching millions of .gov PDFs
To build the govscape server, we use the poetry build system. If poetry is properly installed, then running the following command should build the package:
poetry installTo run the initial version, you first build the embeddings, indices, etc. with:
poetry run python scripts/run_embedding_pipeline.py -p "data/test_data/TechnicalReport234PDFs" -d "data/test_data"Then, you run the RESTful API server with Gunicorn (for production, default worker_class=sync):
GUNICORN_WORKERS=2 \
poetry run gunicorn -c gunicorn.conf.py 'scripts.python_helpers.start_api_server:create_app()'Or use the wrapper to pass your usual CLI app arguments along with Gunicorn:
poetry run -- python -m scripts/python_helpers/run_gunicorn.py \
-p data/test_data/TechnicalReport234PDFs \
-d data/test_data \
-tm ST -vm CLIP -k 20 -i Memory -- \
'gunicorn -c gunicorn.conf.py scripts.python_helpers.start_api_server:create_app()'Tuning knobs (Gunicorn env vars supported by gunicorn.conf.py):
GUNICORN_WORKERSGUNICORN_THREADS(only applies to gthread workers)GUNICORN_WORKER_CLASSGUNICORN_TIMEOUTGUNICORN_MAX_REQUESTSGUNICORN_MAX_REQUESTS_JITTERGUNICORN_PRELOAD_APP
For development, you can still use the simple runner:
poetry run python scripts/start_api_server.py -p "data/test_data/TechnicalReport234PDFs" -d "data/test_data"The project includes a RESTful API server built with Flask and documented with Swagger/OpenAPI. To access the API playground:
- Start the server using the instructions above
- Visit http://localhost:8080/docs
- Use the Swagger UI to try out the endpoints
- Check the response codes and data formats
When adding new endpoints to the API:
- Define the request/response models using Flask-RESTX fields
- Create a new Resource class in the appropriate namespace
- Use the
@ns.doc()and@ns.response()decorators for documentation - Add example requests/responses in the Swagger UI