Skip to content

pandego/arxiv-2-tweet

Repository files navigation

arxiv-2-tweet

Just another repository to help automate the analysis of PDFs.

Development setup

This project uses uv for dependency management, so no Conda environment file is required. Once uv is installed you can create a virtual environment and install the dependencies with:

uv venv
uv sync --all-groups

Environment variables

The app persists session data using a local ChromaDB store. For local development:

  • Copy .env.example to .env and fill in values. The application automatically loads this file at startup, so you can usually just edit it once and go (set ARXIV_2_TWEET_ENV_FILE to point at a different file if desired).
  • If you prefer explicit exports, load the file into your shell before running commands, for example:
set -a; source .env; set +a

Key settings:

  • CHROMADB_PATH (optional): directory used for the persistent ChromaDB index. Defaults to .chromadb inside the project root.
  • CHROMADB_COLLECTION_PREFIX (optional): prefix for created collections. Defaults to arxiv2tweet.
  • ARXIV_2_TWEET_STORAGE_ROOT (optional): directory for binary assets (originals, derived markdown/images). Defaults to storage/ in the project root.

Generation providers (optional):

  • OpenAI: set OPENAI_API_KEY (and optionally OPENAI_BASE_URL).
  • Groq: set GROQ_API_KEY (uses https://api.groq.com/openai/v1).
  • Ollama: no key needed; ensure Ollama is running locally (default http://localhost:11434, override with OLLAMA_BASE_URL).
  • Optional login gate: set APP_AUTH_CREDENTIALS to a JSON object mapping usernames to passwords (for example {"analyst": "s3cret"}). Leave unset to keep the app open.

Docling WebGPU mode

The Streamlit UI now exposes a Local WebGPU ingestion option. When active the browser executes Granite Docling and only the structured results (markdown, JSON, images) are written to the local Chroma-backed storage.

To enable it you must load a Granite Docling WebGPU bundle into the page:

  1. Host the bundle (for example from the docling-ts examples or the Hugging Face WebGPU demo) and inject it so that it registers window.graniteDoclingWebGPU.convertFile(file, options).
  2. The converter should resolve to a mapping with markdown, text, doc_json, images, and optional metadata. Each image entry should contain a base64 data payload.
  3. When WebGPU is unavailable the component exposes a "Load sample" control so you can validate the rest of the ingestion pipeline without the model.

Use the "Reset WebGPU session" control in the UI to clear cached payloads and start a fresh local run.

Embeddings

  • The retrieval pipeline uses Ollama embeddings by default (nomic-embed-text).
  • Make sure the Ollama daemon is running locally and the model is pulled:
ollama pull nomic-embed-text

You can override the embedding model via environment variable OLLAMA_BASE_URL or by creating a custom OllamaEmbeddingProvider.

Local execution

Run with uv (recommended)

  1. Export the environment variables described earlier.

  2. Launch Streamlit directly via uv:

    uv run streamlit run src/arxiv_2_tweet/ui/app.py

    Streamlit listens on port 8501 by default; override with --server.port if you need a different port.

Run with Docker

The included Dockerfile produces an image suitable for local runs:

docker build -t arxiv-2-tweet .
docker run --env-file .env -p 8501:8501 arxiv-2-tweet

Bind-mount the source directory if you want live code edits:

docker run --env-file .env -p 8501:8501 \
  -v "$(pwd)/src:/app/src" \
  arxiv-2-tweet

When running in Docker, ensure provider keys and any optional Chroma configuration values are present in the .env file (or pass them individually with -e KEY=value).

About

Just another repository to help automate the analysis of PDFs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published