Just another repository to help automate the analysis of PDFs.
This project uses uv for dependency management, so no Conda environment file is required. Once uv is installed you can create a virtual environment and install the dependencies with:
uv venv
uv sync --all-groupsThe app persists session data using a local ChromaDB store. For local development:
- Copy
.env.exampleto.envand fill in values. The application automatically loads this file at startup, so you can usually just edit it once and go (setARXIV_2_TWEET_ENV_FILEto point at a different file if desired). - If you prefer explicit exports, load the file into your shell before running commands, for example:
set -a; source .env; set +aKey settings:
CHROMADB_PATH(optional): directory used for the persistent ChromaDB index. Defaults to.chromadbinside the project root.CHROMADB_COLLECTION_PREFIX(optional): prefix for created collections. Defaults toarxiv2tweet.ARXIV_2_TWEET_STORAGE_ROOT(optional): directory for binary assets (originals, derived markdown/images). Defaults tostorage/in the project root.
Generation providers (optional):
- OpenAI: set
OPENAI_API_KEY(and optionallyOPENAI_BASE_URL). - Groq: set
GROQ_API_KEY(useshttps://api.groq.com/openai/v1). - Ollama: no key needed; ensure Ollama is running locally (default
http://localhost:11434, override withOLLAMA_BASE_URL). - Optional login gate: set
APP_AUTH_CREDENTIALSto a JSON object mapping usernames to passwords (for example{"analyst": "s3cret"}). Leave unset to keep the app open.
The Streamlit UI now exposes a Local WebGPU ingestion option. When active the browser executes Granite Docling and only the structured results (markdown, JSON, images) are written to the local Chroma-backed storage.
To enable it you must load a Granite Docling WebGPU bundle into the page:
- Host the bundle (for example from the
docling-ts examples or the
Hugging Face WebGPU demo) and inject it so that it registers
window.graniteDoclingWebGPU.convertFile(file, options). - The converter should resolve to a mapping with
markdown,text,doc_json,images, and optionalmetadata. Each image entry should contain a base64datapayload. - When WebGPU is unavailable the component exposes a "Load sample" control so you can validate the rest of the ingestion pipeline without the model.
Use the "Reset WebGPU session" control in the UI to clear cached payloads and start a fresh local run.
- The retrieval pipeline uses Ollama embeddings by default (
nomic-embed-text). - Make sure the Ollama daemon is running locally and the model is pulled:
ollama pull nomic-embed-textYou can override the embedding model via environment variable OLLAMA_BASE_URL
or by creating a custom OllamaEmbeddingProvider.
-
Export the environment variables described earlier.
-
Launch Streamlit directly via
uv:uv run streamlit run src/arxiv_2_tweet/ui/app.py
Streamlit listens on port 8501 by default; override with
--server.portif you need a different port.
The included Dockerfile produces an image suitable for local runs:
docker build -t arxiv-2-tweet .
docker run --env-file .env -p 8501:8501 arxiv-2-tweetBind-mount the source directory if you want live code edits:
docker run --env-file .env -p 8501:8501 \
-v "$(pwd)/src:/app/src" \
arxiv-2-tweetWhen running in Docker, ensure provider keys and any optional Chroma configuration
values are present in the .env file (or pass them individually with -e KEY=value).