feat: add HuggingFace as a local-first model provider (GGUF/llama-cpp) by Empreiteiro · Pull Request #12956 · langflow-ai/langflow

Empreiteiro · 2026-05-01T23:23:30Z

Draft for discussion. Opens up a parallel local-inference path that doesn't require Ollama or an external API key. Happy to split into smaller PRs (catalog/registry + adapter + frontend icon) if that's preferable for review.

Summary

Registers HuggingFace alongside the other configurable providers, running models locally via llama-cpp-python + GGUF (no torch, no transformers, fork-safe on macOS arm64).
Onboarding ships a single bundled model — bartowski/SmolLM2-360M-Instruct-GGUF (Q4_K_M, ~270MB). Settings → Model Providers shows one toggle for it, on by default.
Toggle-on triggers background download — flipping a HuggingFace model on in POST /enabled_models schedules a single-file download so the cache is warm by the next invocation.
Add-more-models via API — POST /api/v1/models/huggingface/download {"model_id": "<gguf-repo-id>"} accepts any HF repo id that publishes GGUF weights.
Optional startup prefetch — LANGFLOW_PREFETCH_HF_DEFAULT=true warms the cache on lifespan start. Off by default.
Subprocess-isolated downloads — if hf_hub_download ever crashes the worker, a subprocess retry kicks in so the parent uvicorn worker survives.

Why GGUF / llama-cpp instead of transformers

The first iteration used langchain_huggingface.HuggingFacePipeline (transformers + torch). On macOS arm64 + Python 3.12, two SIGSEGV vectors broke this end-to-end:

Importing torch inside a forked uvicorn worker SIGSEGV'd at first device init.
huggingface_hub.snapshot_download's parallel fetcher SIGSEGV'd at 0% download progress.

Mitigations (device=-1, low_cpu_mem_usage=True, max_workers=1, OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES) only narrowed the failure window. Switching to llama-cpp-python removes the torch dependency entirely — it's a C/C++ backend with a thin Python binding, fork-safe, and quantized models are smaller and faster on CPU.

What changed

Provider registry & catalog

src/lfx/src/lfx/base/models/huggingface_constants.py (new) — single bundled GGUF entry (bartowski/SmolLM2-360M-Instruct-GGUF, default=True).
src/lfx/src/lfx/base/models/model_metadata.py — registers HuggingFace in MODEL_PROVIDER_METADATA. HUGGINGFACEHUB_API_TOKEN is optional (only needed for gated repos).
src/lfx/src/lfx/base/models/unified_models/provider_queries.py — wires HUGGINGFACE_MODELS_DETAILED into get_models_detailed().
src/lfx/src/lfx/base/models/unified_models/class_registry.py — adds ChatHuggingFace to the lazy import map.

Local chat adapter (llama-cpp backend)

src/lfx/src/lfx/base/models/huggingface_chat_model.py (new) — factory ChatHuggingFace translates the unified kwargs into a langchain_community.chat_models.ChatLlamaCpp instance. Picks the right .gguf filename via a per-repo override (GGUF_FILENAME_BY_REPO) with a <name>-Q4_K_M.gguf fallback for bartowski-style repos.
- In-process cache (_LLAMA_CACHE) keys instances by (model_path, temperature, max_tokens) so repeat calls don't re-mmap the model.
- download_model() uses hf_hub_download (single-file). Disables xet, hf_transfer, telemetry, and progress bars at module import to avoid worker-fork SIGSEGV. If the in-process call still raises, retries the same call inside an isolated subprocess so a hard crash can't take down the parent worker.
- list_installed_models() filters cache entries to repos that actually contain a .gguf file.
- n_threads = cpu_count() - 1, n_gpu_layers = 0 by default (CPU-only).

Local-provider plumbing

src/lfx/src/lfx/base/models/unified_models/instantiation.py — HuggingFace joins Ollama in the "no API key required" exception.
src/lfx/src/lfx/base/models/unified_models/credentials.py — providers whose variables are all optional now stay enabled by default; also adds an HF token validator that pings huggingface.co/api/whoami-v2.
src/lfx/src/lfx/base/models/unified_models/build_config.py — for optional provider variables, only auto-install load_from_db=True when the variable is actually configured. Fixes "<VAR> variable not found." errors at flow build time.

API

src/backend/base/langflow/api/v1/models.py
- GET /models/huggingface/installed — list GGUF repo ids in the local Hub cache.
- POST /models/huggingface/download — download a .gguf file for the given repo id (validates length, reuses saved token for gated pulls).
- POST /enabled_models schedules a background hf_hub_download whenever a HuggingFace model is toggled on. Failures are logged but never block the toggle save.

Lifespan startup

src/backend/base/langflow/main.py — _prefetch_default_huggingface_model() runs as a background task during lifespan startup only when LANGFLOW_PREFETCH_HF_DEFAULT=true. Tracked alongside sync_flows_from_fs_task and mcp_init_task; cancelled on lifespan shutdown.

Frontend

src/frontend/src/controllers/API/queries/models/use-get-model-providers.ts — adds HuggingFace to the hardcoded provider→icon lookup. Without this, the Agent component (which filters tool_calling=True and so doesn't see the HF model in its options list) was falling through to the default Bot lucide icon for HF entries while the Language Model component rendered the correct HF logo.

Packaging

src/backend/base/pyproject.toml — adds langflow-base[local] to the [complete] extra so fresh pip install langflow installs pull llama-cpp-python automatically.
uv.lock — regenerated. The Makefile uses uv sync --frozen for make backend/make run_cli, which only installs what's pinned in the lockfile; without this regeneration contributors wouldn't get the new dep on a fresh checkout.

How to use

Start Langflow — make backend. Server boots clean, no prefetch by default.
Settings → Model Providers → HuggingFace — toggle is on. Toggling fires a background hf_hub_download of the GGUF file. First invocation from a flow loads the GGUF and answers.
Install another model — point at any GGUF repo + filename pair:
```
curl -X POST http://localhost:7860/api/v1/models/huggingface/download \
     -H 'Content-Type: application/json' \
     -d '{"model_id": "bartowski/Qwen2.5-0.5B-Instruct-GGUF"}'
```
Filename auto-resolves to Qwen2.5-0.5B-Instruct-Q4_K_M.gguf via the bartowski-style heuristic. For non-conforming repos, add an entry to GGUF_FILENAME_BY_REPO.
Gated models: paste HUGGINGFACEHUB_API_TOKEN in Settings → Global Variables. Required for the few GGUF repos that gate downloads.
(Optional) Warm the cache at startup: LANGFLOW_PREFETCH_HF_DEFAULT=true make backend.

Tool calling

The bundled SmolLM2-360M-Instruct does not have reliable tool calling — it's marked tool_calling=False in the catalog. To use the HuggingFace provider with the Agent component's tools, install a tool-calling-capable GGUF on demand, e.g.:

curl -X POST http://localhost:7860/api/v1/models/huggingface/download \
     -H 'Content-Type: application/json' \
     -d '{"model_id": "bartowski/Hermes-3-Llama-3.2-3B-GGUF"}'

Dependencies

llama-cpp-python — already in langflow-base[local], now pulled by [complete] (which langflow consumes).
langchain-community — already a core dep; provides ChatLlamaCpp.
huggingface_hub — already a core dep; provides hf_hub_download.

No new top-level dependencies. The transformers / torch / langchain-huggingface stack is no longer touched by the HF chat path.

Env var summary

Variable	Default	Effect
`LANGFLOW_PREFETCH_HF_DEFAULT`	unset / `false`	When `true`/`1`/`yes`, prefetch the bundled GGUF during lifespan startup.
`HUGGINGFACEHUB_API_TOKEN`	unset	Optional. Required only to download gated repos.
`HF_HUB_DISABLE_XET`	`1` (forced)	Disabled by the adapter to avoid the parallel xet backend.
`HF_HUB_ENABLE_HF_TRANSFER`	`0` (forced)	Disabled by the adapter to keep downloads in the plain HTTP path.

Test plan

ruff check + ruff format clean across all modified files.
Existing model tests pass (test_max_tokens_propagation.py, test_model_input_fixes.py) — 37 passed, 1 skipped.
Smoke: catalog has 1 entry (bartowski/SmolLM2-360M-Instruct-GGUF, default=True); _pick_gguf_filename resolves to SmolLM2-360M-Instruct-Q4_K_M.gguf; list_installed_models() reads the real cache.
Manual: macOS arm64 — server boots cleanly, no SIGSEGV. Toggle the bundled model → background download completes → flow invocation returns a response from the local GGUF.
Manual: Agent component on macOS arm64 — HF model dropdown trigger renders the HuggingFace logo (after frontend rebuild + hard reload).
Manual: POST /models/huggingface/download with a different bartowski-style GGUF repo → confirm download → flow can switch to it.
Manual: try a tool-calling-capable model (bartowski/Hermes-3-Llama-3.2-3B-GGUF) with the Agent component and confirm Calculator/URL tools fire.
Manual: try a gated model (e.g. bartowski/Llama-3.2-3B-Instruct-GGUF) with a saved HF token.
Manual: LANGFLOW_PREFETCH_HF_DEFAULT=true on a fresh cache actually warms it during lifespan startup without crashing.

🤖 Generated with Claude Code

Registers HuggingFace alongside the other configurable providers in the unified model catalog, but runs models locally via langchain-huggingface's HuggingFacePipeline + transformers — no external API calls required. - Default bundled model: HuggingFaceTB/SmolLM2-360M-Instruct (~720MB), small and CPU-friendly so a fresh install can answer prompts after the first lazy download. - Catalog ships small instruct checkpoints (SmolLM2 135M/1.7B, Qwen2.5 0.5B/1.5B) plus larger gated options (Llama-3.2 1B/3B, Phi-3.5-mini). - HUGGINGFACEHUB_API_TOKEN is optional — only needed to pull gated repos. - Providers with no required variables now stay enabled by default so the HF entry surfaces without the user having to configure credentials. - New endpoints: GET /api/v1/models/huggingface/installed lists repos present in the local Hub cache, and POST /api/v1/models/huggingface/ download eagerly fetches a model via huggingface_hub.snapshot_download (reusing the user's saved token for gated downloads). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The HuggingFace provider failed at build time with "HUGGINGFACEHUB_API_TOKEN variable not found." even though its token is documented as optional. Root cause: apply_provider_variable_config_to_build_config unconditionally set load_from_db=True with the canonical variable key on the provider's api_key field, so the runtime tried to resolve a value that the user had never configured and raised. For *required* provider variables the behavior is unchanged. For optional ones (top-level required=False) we now only auto-install load_from_db=True when the variable is actually present in the user's globals or in the process environment; otherwise we leave the field empty so the runtime gets a None api_key (which the local HuggingFace adapter handles fine). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…toggle Onboarding simplification: the HuggingFace catalog now ships exactly one model (HuggingFaceTB/SmolLM2-360M-Instruct, ~720MB, fast on CPU). The Settings → Model Providers screen shows a single toggle for it, defaulting to ON. When the user flips a HuggingFace model toggle on, POST /enabled_models now schedules a background snapshot_download into the local Hub cache so the first flow invocation doesn't pay the cold-start latency. Failures are logged but never block the toggle from being saved. Strong refs to in-flight tasks live at module scope to satisfy RUF006. The unified catalog's "first 5 are default" auto-promotion now defers to explicit per-model `default=True` declarations when any are present, so HuggingFace gets exactly the bundled model on by default while the other providers keep their existing behavior. Additional HF models can still be installed via POST /api/v1/models/huggingface/download with an arbitrary repo id. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The earlier "honour explicit default=True per model" change broke the existing TestUnifiedModelsDefaults invariants: - IBM WatsonX declares default=True on all 7 of its models, so honouring the explicit flags returned 7 defaults where the test expects ≤5. - Google Generative AI doesn't declare any explicit defaults, so the fallback path was the only one exercised — but the override still changed the contract. Revert to the original "first 5 models per provider are default" behavior. The HuggingFace onboarding goal (single bundled model toggled on by default) is satisfied automatically because HUGGINGFACE_MODELS_DETAILED now contains exactly one entry, and i=0 < 5 lands it in the default set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The local pipeline was triggering worker SIGSEGV on first model load on macOS arm64 + Python 3.12. The crash happened at the very start of the weight download (0% progress), which points at torch's device init path running inside a forked uvicorn worker rather than the download itself. - device=-1 — force CPU and skip MPS/CUDA negotiation, which is the most fragile leg of torch on first-import-after-fork. - low_cpu_mem_usage=True — stream weights through the model during from_pretrained instead of double-buffering them, lowering peak RAM. If the SIGSEGV still happens, the workaround is to start the server with OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES exported (a known torch+Objective-C fork-safety interaction on macOS, not specific to this adapter). Documented inline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The first flow run that uses the local HuggingFace provider would block the request thread for tens of seconds while transformers pulled ~720MB to ~/.cache/huggingface. Worse, on macOS arm64 + Python 3.12 the load inside a uvicorn worker can SIGSEGV on torch's device init. Pre-warming the cache during lifespan startup uses huggingface_hub.snapshot_download exclusively (no torch import), so it cannot trigger the worker SIGSEGV — and by the time the user sends the first message, the weights are already on disk and the inference path only pays the load + generate cost. - Runs as a background task; tracked alongside sync_flows_from_fs_task and mcp_init_task and cancelled on lifespan shutdown. - Skippable via LANGFLOW_SKIP_HF_DEFAULT_DOWNLOAD=true (1/yes also work). - Forwards HUGGINGFACEHUB_API_TOKEN if set in env so gated default models would still pull. - Failures are logged at warning and never block startup; the first inference call will retry the download on demand. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tened The startup prefetch was triggering a server crash loop on macOS arm64: huggingface_hub.snapshot_download itself segfaulted at 0% (parallel download backend interacting badly with forked uvicorn workers), the worker died, uvicorn auto-reload restarted, and the cycle repeated. The log also showed a "4.66 GB" total because the unfiltered snapshot pulled every weight format the repo carries (safetensors + pytorch_model.bin + ONNX + GGML). Two changes: 1. Flip prefetch to opt-in: LANGFLOW_PREFETCH_HF_DEFAULT=true (was "skip" via LANGFLOW_SKIP_HF_DEFAULT_DOWNLOAD). Default is now OFF so a fresh install never crash-loops; users who actually want the warm cache enable it explicitly. 2. Harden download_model: - allow_patterns restricts the snapshot to safetensors + tokenizer + config (no pytorch_model.bin, ONNX, GGML, etc.) - typically cuts download size by 4-6x. - max_workers=1 serializes file fetches; the multi-thread path is what was crashing inside the worker on macOS arm64. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… (GGUF) The transformers + torch path was unsalvageable on macOS arm64 + Python 3.12: both the inference load (torch device init in a forked uvicorn worker) and the snapshot_download parallel fetcher SIGSEGV'd, with no Python-level recovery possible. Mitigations like device=-1, low_cpu_mem_usage, max_workers=1, and the OBJC fork-safety env var only narrowed the failure window without closing it. This commit replaces the backend wholesale: - ChatHuggingFace now produces a langchain_community ChatLlamaCpp (llama-cpp-python under the hood). No torch import, no fork-safety pitfall, fast on CPU thanks to quantization. - The bundled default flips from HuggingFaceTB/SmolLM2-360M-Instruct (~720MB safetensors) to bartowski/SmolLM2-360M-Instruct-GGUF, file SmolLM2-360M-Instruct-Q4_K_M.gguf (~270MB). Smaller download, similar quality, runs in <500MB RAM. - download_model uses hf_hub_download for a single .gguf file (no snapshot_download, no parallel fetcher). - list_installed_models now filters cache entries to repos that actually contain a .gguf file we can load. - A small in-process cache keys ChatLlamaCpp instances by (model_path, temperature, max_tokens) so repeat calls reuse the same mmaped model instead of reloading. - Filename selection: catalog overrides via GGUF_FILENAME_BY_REPO; fallback is "<model-name>-Q4_K_M.gguf" which works for all bartowski-style repos. llama-cpp-python is already in langflow-base[llama-cpp] (already part of the full langflow install). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…allback) The single-file hf_hub_download path was still crashing uvicorn workers on macOS arm64. Two complementary fixes: 1. Disable accelerated backends at module import time. xet and hf_transfer both spawn worker threads/processes whose fork-safety is broken on this platform. Forcing the plain HTTP path is more than fast enough for ~270MB GGUFs. - HF_HUB_DISABLE_XET=1 - HF_HUB_ENABLE_HF_TRANSFER=0 - HF_HUB_DISABLE_TELEMETRY=1 (drops one more import) - HF_HUB_DISABLE_PROGRESS_BARS=1 (drops tqdm in worker context) 2. If the in-process call still raises, retry the exact same hf_hub_download in an isolated subprocess via subprocess.run. A child process that crashes can't take the parent uvicorn worker with it; the parent recovers, logs the failure, and propagates the path captured from stdout when the subprocess succeeds. 600s timeout to bound network stalls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

llama-cpp-python lives in the [local] extra of langflow-base but wasn't included in [complete], so the langflow main install (which pulls langflow-base[complete]) shipped without it. The new HuggingFace local provider needs llama-cpp-python at runtime, so the user saw: ImportError: Could not import llama-cpp-python library. Pulling [local] into [complete] makes the full langflow install include it without bloating bare langflow-base setups (which can still skip it). For an existing dev env, install on demand: uv pip install llama-cpp-python or: uv sync --reinstall Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Makefile install_backend target uses 'uv sync --frozen', which only installs what's in uv.lock and ignores fresh additions to pyproject.toml. Without regenerating the lockfile, contributors running 'make run_cli' or 'make backend' wouldn't get llama-cpp-python even though [complete] now references [local]. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Language Model component renders the HF logo correctly because it reads icon directly from the backend's option metadata. The Agent component filters models by tool_calling=True; HF (which doesn't claim tool_calling) doesn't land in that filtered list, so the trigger falls through to providersData[*].icon — which goes through the frontend's hardcoded getProviderIcon lookup. That map didn't include HuggingFace, so it returned 'Bot' and rendered the lucide robot icon next to the HF model in the trigger. Adding HuggingFace -> "HuggingFace" to the lookup makes the trigger match the dropdown list and the Language Model component. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-01T23:23:36Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 848239e8-e197-483e-9fd6-faa20599910a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Empreiteiro and others added 12 commits May 1, 2026 16:21

github-actions Bot added the enhancement New feature or request label May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add HuggingFace as a local-first model provider (GGUF/llama-cpp)#12956

feat: add HuggingFace as a local-first model provider (GGUF/llama-cpp)#12956
Empreiteiro wants to merge 12 commits intolangflow-ai:mainfrom
Empreiteiro:upstream-pr/huggingface-provider

Empreiteiro commented May 1, 2026

Uh oh!

coderabbitai Bot commented May 1, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Empreiteiro commented May 1, 2026

Summary

Why GGUF / llama-cpp instead of transformers

What changed

How to use

Tool calling

Dependencies

Env var summary

Test plan

Uh oh!

coderabbitai Bot commented May 1, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant