Skip to content

feat: add HuggingFace as a local-first model provider (GGUF/llama-cpp)#12956

Draft
Empreiteiro wants to merge 12 commits intolangflow-ai:mainfrom
Empreiteiro:upstream-pr/huggingface-provider
Draft

feat: add HuggingFace as a local-first model provider (GGUF/llama-cpp)#12956
Empreiteiro wants to merge 12 commits intolangflow-ai:mainfrom
Empreiteiro:upstream-pr/huggingface-provider

Conversation

@Empreiteiro
Copy link
Copy Markdown
Collaborator

Draft for discussion. Opens up a parallel local-inference path that doesn't require Ollama or an external API key. Happy to split into smaller PRs (catalog/registry + adapter + frontend icon) if that's preferable for review.

Summary

  • Registers HuggingFace alongside the other configurable providers, running models locally via llama-cpp-python + GGUF (no torch, no transformers, fork-safe on macOS arm64).
  • Onboarding ships a single bundled modelbartowski/SmolLM2-360M-Instruct-GGUF (Q4_K_M, ~270MB). Settings → Model Providers shows one toggle for it, on by default.
  • Toggle-on triggers background download — flipping a HuggingFace model on in POST /enabled_models schedules a single-file download so the cache is warm by the next invocation.
  • Add-more-models via APIPOST /api/v1/models/huggingface/download {"model_id": "<gguf-repo-id>"} accepts any HF repo id that publishes GGUF weights.
  • Optional startup prefetchLANGFLOW_PREFETCH_HF_DEFAULT=true warms the cache on lifespan start. Off by default.
  • Subprocess-isolated downloads — if hf_hub_download ever crashes the worker, a subprocess retry kicks in so the parent uvicorn worker survives.

Why GGUF / llama-cpp instead of transformers

The first iteration used langchain_huggingface.HuggingFacePipeline (transformers + torch). On macOS arm64 + Python 3.12, two SIGSEGV vectors broke this end-to-end:

  1. Importing torch inside a forked uvicorn worker SIGSEGV'd at first device init.
  2. huggingface_hub.snapshot_download's parallel fetcher SIGSEGV'd at 0% download progress.

Mitigations (device=-1, low_cpu_mem_usage=True, max_workers=1, OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES) only narrowed the failure window. Switching to llama-cpp-python removes the torch dependency entirely — it's a C/C++ backend with a thin Python binding, fork-safe, and quantized models are smaller and faster on CPU.

What changed

Provider registry & catalog

  • src/lfx/src/lfx/base/models/huggingface_constants.py (new) — single bundled GGUF entry (bartowski/SmolLM2-360M-Instruct-GGUF, default=True).
  • src/lfx/src/lfx/base/models/model_metadata.py — registers HuggingFace in MODEL_PROVIDER_METADATA. HUGGINGFACEHUB_API_TOKEN is optional (only needed for gated repos).
  • src/lfx/src/lfx/base/models/unified_models/provider_queries.py — wires HUGGINGFACE_MODELS_DETAILED into get_models_detailed().
  • src/lfx/src/lfx/base/models/unified_models/class_registry.py — adds ChatHuggingFace to the lazy import map.

Local chat adapter (llama-cpp backend)

  • src/lfx/src/lfx/base/models/huggingface_chat_model.py (new) — factory ChatHuggingFace translates the unified kwargs into a langchain_community.chat_models.ChatLlamaCpp instance. Picks the right .gguf filename via a per-repo override (GGUF_FILENAME_BY_REPO) with a <name>-Q4_K_M.gguf fallback for bartowski-style repos.
    • In-process cache (_LLAMA_CACHE) keys instances by (model_path, temperature, max_tokens) so repeat calls don't re-mmap the model.
    • download_model() uses hf_hub_download (single-file). Disables xet, hf_transfer, telemetry, and progress bars at module import to avoid worker-fork SIGSEGV. If the in-process call still raises, retries the same call inside an isolated subprocess so a hard crash can't take down the parent worker.
    • list_installed_models() filters cache entries to repos that actually contain a .gguf file.
    • n_threads = cpu_count() - 1, n_gpu_layers = 0 by default (CPU-only).

Local-provider plumbing

  • src/lfx/src/lfx/base/models/unified_models/instantiation.pyHuggingFace joins Ollama in the "no API key required" exception.
  • src/lfx/src/lfx/base/models/unified_models/credentials.py — providers whose variables are all optional now stay enabled by default; also adds an HF token validator that pings huggingface.co/api/whoami-v2.
  • src/lfx/src/lfx/base/models/unified_models/build_config.py — for optional provider variables, only auto-install load_from_db=True when the variable is actually configured. Fixes "<VAR> variable not found." errors at flow build time.

API

  • src/backend/base/langflow/api/v1/models.py
    • GET /models/huggingface/installed — list GGUF repo ids in the local Hub cache.
    • POST /models/huggingface/download — download a .gguf file for the given repo id (validates length, reuses saved token for gated pulls).
    • POST /enabled_models schedules a background hf_hub_download whenever a HuggingFace model is toggled on. Failures are logged but never block the toggle save.

Lifespan startup

  • src/backend/base/langflow/main.py_prefetch_default_huggingface_model() runs as a background task during lifespan startup only when LANGFLOW_PREFETCH_HF_DEFAULT=true. Tracked alongside sync_flows_from_fs_task and mcp_init_task; cancelled on lifespan shutdown.

Frontend

  • src/frontend/src/controllers/API/queries/models/use-get-model-providers.ts — adds HuggingFace to the hardcoded provider→icon lookup. Without this, the Agent component (which filters tool_calling=True and so doesn't see the HF model in its options list) was falling through to the default Bot lucide icon for HF entries while the Language Model component rendered the correct HF logo.

Packaging

  • src/backend/base/pyproject.toml — adds langflow-base[local] to the [complete] extra so fresh pip install langflow installs pull llama-cpp-python automatically.
  • uv.lock — regenerated. The Makefile uses uv sync --frozen for make backend/make run_cli, which only installs what's pinned in the lockfile; without this regeneration contributors wouldn't get the new dep on a fresh checkout.

How to use

  1. Start Langflowmake backend. Server boots clean, no prefetch by default.
  2. Settings → Model Providers → HuggingFace — toggle is on. Toggling fires a background hf_hub_download of the GGUF file. First invocation from a flow loads the GGUF and answers.
  3. Install another model — point at any GGUF repo + filename pair:
    curl -X POST http://localhost:7860/api/v1/models/huggingface/download \
         -H 'Content-Type: application/json' \
         -d '{"model_id": "bartowski/Qwen2.5-0.5B-Instruct-GGUF"}'
    Filename auto-resolves to Qwen2.5-0.5B-Instruct-Q4_K_M.gguf via the bartowski-style heuristic. For non-conforming repos, add an entry to GGUF_FILENAME_BY_REPO.
  4. Gated models: paste HUGGINGFACEHUB_API_TOKEN in Settings → Global Variables. Required for the few GGUF repos that gate downloads.
  5. (Optional) Warm the cache at startup: LANGFLOW_PREFETCH_HF_DEFAULT=true make backend.

Tool calling

The bundled SmolLM2-360M-Instruct does not have reliable tool calling — it's marked tool_calling=False in the catalog. To use the HuggingFace provider with the Agent component's tools, install a tool-calling-capable GGUF on demand, e.g.:

curl -X POST http://localhost:7860/api/v1/models/huggingface/download \
     -H 'Content-Type: application/json' \
     -d '{"model_id": "bartowski/Hermes-3-Llama-3.2-3B-GGUF"}'

Dependencies

  • llama-cpp-python — already in langflow-base[local], now pulled by [complete] (which langflow consumes).
  • langchain-community — already a core dep; provides ChatLlamaCpp.
  • huggingface_hub — already a core dep; provides hf_hub_download.

No new top-level dependencies. The transformers / torch / langchain-huggingface stack is no longer touched by the HF chat path.

Env var summary

Variable Default Effect
LANGFLOW_PREFETCH_HF_DEFAULT unset / false When true/1/yes, prefetch the bundled GGUF during lifespan startup.
HUGGINGFACEHUB_API_TOKEN unset Optional. Required only to download gated repos.
HF_HUB_DISABLE_XET 1 (forced) Disabled by the adapter to avoid the parallel xet backend.
HF_HUB_ENABLE_HF_TRANSFER 0 (forced) Disabled by the adapter to keep downloads in the plain HTTP path.

Test plan

  • ruff check + ruff format clean across all modified files.
  • Existing model tests pass (test_max_tokens_propagation.py, test_model_input_fixes.py) — 37 passed, 1 skipped.
  • Smoke: catalog has 1 entry (bartowski/SmolLM2-360M-Instruct-GGUF, default=True); _pick_gguf_filename resolves to SmolLM2-360M-Instruct-Q4_K_M.gguf; list_installed_models() reads the real cache.
  • Manual: macOS arm64 — server boots cleanly, no SIGSEGV. Toggle the bundled model → background download completes → flow invocation returns a response from the local GGUF.
  • Manual: Agent component on macOS arm64 — HF model dropdown trigger renders the HuggingFace logo (after frontend rebuild + hard reload).
  • Manual: POST /models/huggingface/download with a different bartowski-style GGUF repo → confirm download → flow can switch to it.
  • Manual: try a tool-calling-capable model (bartowski/Hermes-3-Llama-3.2-3B-GGUF) with the Agent component and confirm Calculator/URL tools fire.
  • Manual: try a gated model (e.g. bartowski/Llama-3.2-3B-Instruct-GGUF) with a saved HF token.
  • Manual: LANGFLOW_PREFETCH_HF_DEFAULT=true on a fresh cache actually warms it during lifespan startup without crashing.

🤖 Generated with Claude Code

Empreiteiro and others added 12 commits May 1, 2026 16:21
Registers HuggingFace alongside the other configurable providers in the
unified model catalog, but runs models locally via langchain-huggingface's
HuggingFacePipeline + transformers — no external API calls required.

- Default bundled model: HuggingFaceTB/SmolLM2-360M-Instruct (~720MB),
  small and CPU-friendly so a fresh install can answer prompts after the
  first lazy download.
- Catalog ships small instruct checkpoints (SmolLM2 135M/1.7B, Qwen2.5
  0.5B/1.5B) plus larger gated options (Llama-3.2 1B/3B, Phi-3.5-mini).
- HUGGINGFACEHUB_API_TOKEN is optional — only needed to pull gated repos.
- Providers with no required variables now stay enabled by default so the
  HF entry surfaces without the user having to configure credentials.
- New endpoints: GET  /api/v1/models/huggingface/installed lists repos
  present in the local Hub cache, and POST /api/v1/models/huggingface/
  download eagerly fetches a model via huggingface_hub.snapshot_download
  (reusing the user's saved token for gated downloads).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The HuggingFace provider failed at build time with "HUGGINGFACEHUB_API_TOKEN
variable not found." even though its token is documented as optional. Root
cause: apply_provider_variable_config_to_build_config unconditionally set
load_from_db=True with the canonical variable key on the provider's
api_key field, so the runtime tried to resolve a value that the user had
never configured and raised.

For *required* provider variables the behavior is unchanged. For optional
ones (top-level required=False) we now only auto-install load_from_db=True
when the variable is actually present in the user's globals or in the
process environment; otherwise we leave the field empty so the runtime
gets a None api_key (which the local HuggingFace adapter handles fine).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…toggle

Onboarding simplification: the HuggingFace catalog now ships exactly one
model (HuggingFaceTB/SmolLM2-360M-Instruct, ~720MB, fast on CPU). The
Settings → Model Providers screen shows a single toggle for it, defaulting
to ON.

When the user flips a HuggingFace model toggle on, POST /enabled_models
now schedules a background snapshot_download into the local Hub cache so
the first flow invocation doesn't pay the cold-start latency. Failures
are logged but never block the toggle from being saved. Strong refs to
in-flight tasks live at module scope to satisfy RUF006.

The unified catalog's "first 5 are default" auto-promotion now defers to
explicit per-model `default=True` declarations when any are present, so
HuggingFace gets exactly the bundled model on by default while the other
providers keep their existing behavior.

Additional HF models can still be installed via
POST /api/v1/models/huggingface/download with an arbitrary repo id.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The earlier "honour explicit default=True per model" change broke the
existing TestUnifiedModelsDefaults invariants:
- IBM WatsonX declares default=True on all 7 of its models, so honouring
  the explicit flags returned 7 defaults where the test expects ≤5.
- Google Generative AI doesn't declare any explicit defaults, so the
  fallback path was the only one exercised — but the override still
  changed the contract.

Revert to the original "first 5 models per provider are default"
behavior. The HuggingFace onboarding goal (single bundled model toggled
on by default) is satisfied automatically because HUGGINGFACE_MODELS_DETAILED
now contains exactly one entry, and i=0 < 5 lands it in the default set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The local pipeline was triggering worker SIGSEGV on first model load on
macOS arm64 + Python 3.12. The crash happened at the very start of the
weight download (0% progress), which points at torch's device init path
running inside a forked uvicorn worker rather than the download itself.

- device=-1 — force CPU and skip MPS/CUDA negotiation, which is the most
  fragile leg of torch on first-import-after-fork.
- low_cpu_mem_usage=True — stream weights through the model during
  from_pretrained instead of double-buffering them, lowering peak RAM.

If the SIGSEGV still happens, the workaround is to start the server with
OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES exported (a known
torch+Objective-C fork-safety interaction on macOS, not specific to this
adapter). Documented inline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first flow run that uses the local HuggingFace provider would block
the request thread for tens of seconds while transformers pulled ~720MB
to ~/.cache/huggingface. Worse, on macOS arm64 + Python 3.12 the load
inside a uvicorn worker can SIGSEGV on torch's device init.

Pre-warming the cache during lifespan startup uses
huggingface_hub.snapshot_download exclusively (no torch import), so it
cannot trigger the worker SIGSEGV — and by the time the user sends the
first message, the weights are already on disk and the inference path
only pays the load + generate cost.

- Runs as a background task; tracked alongside sync_flows_from_fs_task
  and mcp_init_task and cancelled on lifespan shutdown.
- Skippable via LANGFLOW_SKIP_HF_DEFAULT_DOWNLOAD=true (1/yes also work).
- Forwards HUGGINGFACEHUB_API_TOKEN if set in env so gated default
  models would still pull.
- Failures are logged at warning and never block startup; the first
  inference call will retry the download on demand.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tened

The startup prefetch was triggering a server crash loop on macOS arm64:
huggingface_hub.snapshot_download itself segfaulted at 0% (parallel
download backend interacting badly with forked uvicorn workers), the
worker died, uvicorn auto-reload restarted, and the cycle repeated. The
log also showed a "4.66 GB" total because the unfiltered snapshot pulled
every weight format the repo carries (safetensors + pytorch_model.bin +
ONNX + GGML).

Two changes:

1. Flip prefetch to opt-in: LANGFLOW_PREFETCH_HF_DEFAULT=true (was
   "skip" via LANGFLOW_SKIP_HF_DEFAULT_DOWNLOAD). Default is now OFF so
   a fresh install never crash-loops; users who actually want the warm
   cache enable it explicitly.

2. Harden download_model:
   - allow_patterns restricts the snapshot to safetensors + tokenizer +
     config (no pytorch_model.bin, ONNX, GGML, etc.) - typically cuts
     download size by 4-6x.
   - max_workers=1 serializes file fetches; the multi-thread path is
     what was crashing inside the worker on macOS arm64.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (GGUF)

The transformers + torch path was unsalvageable on macOS arm64 + Python
3.12: both the inference load (torch device init in a forked uvicorn
worker) and the snapshot_download parallel fetcher SIGSEGV'd, with no
Python-level recovery possible. Mitigations like device=-1,
low_cpu_mem_usage, max_workers=1, and the OBJC fork-safety env var only
narrowed the failure window without closing it.

This commit replaces the backend wholesale:

- ChatHuggingFace now produces a langchain_community ChatLlamaCpp
  (llama-cpp-python under the hood). No torch import, no fork-safety
  pitfall, fast on CPU thanks to quantization.
- The bundled default flips from
    HuggingFaceTB/SmolLM2-360M-Instruct (~720MB safetensors)
  to
    bartowski/SmolLM2-360M-Instruct-GGUF, file
    SmolLM2-360M-Instruct-Q4_K_M.gguf (~270MB).
  Smaller download, similar quality, runs in <500MB RAM.
- download_model uses hf_hub_download for a single .gguf file (no
  snapshot_download, no parallel fetcher).
- list_installed_models now filters cache entries to repos that actually
  contain a .gguf file we can load.
- A small in-process cache keys ChatLlamaCpp instances by
  (model_path, temperature, max_tokens) so repeat calls reuse the same
  mmaped model instead of reloading.
- Filename selection: catalog overrides via GGUF_FILENAME_BY_REPO; fallback
  is "<model-name>-Q4_K_M.gguf" which works for all bartowski-style repos.

llama-cpp-python is already in langflow-base[llama-cpp] (already part of
the full langflow install).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…allback)

The single-file hf_hub_download path was still crashing uvicorn workers
on macOS arm64. Two complementary fixes:

1. Disable accelerated backends at module import time. xet and
   hf_transfer both spawn worker threads/processes whose fork-safety is
   broken on this platform. Forcing the plain HTTP path is more than
   fast enough for ~270MB GGUFs.

   - HF_HUB_DISABLE_XET=1
   - HF_HUB_ENABLE_HF_TRANSFER=0
   - HF_HUB_DISABLE_TELEMETRY=1 (drops one more import)
   - HF_HUB_DISABLE_PROGRESS_BARS=1 (drops tqdm in worker context)

2. If the in-process call still raises, retry the exact same
   hf_hub_download in an isolated subprocess via subprocess.run. A child
   process that crashes can't take the parent uvicorn worker with it;
   the parent recovers, logs the failure, and propagates the path
   captured from stdout when the subprocess succeeds. 600s timeout to
   bound network stalls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
llama-cpp-python lives in the [local] extra of langflow-base but wasn't
included in [complete], so the langflow main install (which pulls
langflow-base[complete]) shipped without it. The new HuggingFace local
provider needs llama-cpp-python at runtime, so the user saw:

  ImportError: Could not import llama-cpp-python library.

Pulling [local] into [complete] makes the full langflow install include
it without bloating bare langflow-base setups (which can still skip it).

For an existing dev env, install on demand:
  uv pip install llama-cpp-python
or:
  uv sync --reinstall

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Makefile install_backend target uses 'uv sync --frozen', which only
installs what's in uv.lock and ignores fresh additions to pyproject.toml.
Without regenerating the lockfile, contributors running 'make run_cli'
or 'make backend' wouldn't get llama-cpp-python even though [complete]
now references [local].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Language Model component renders the HF logo correctly because it
reads icon directly from the backend's option metadata. The Agent
component filters models by tool_calling=True; HF (which doesn't claim
tool_calling) doesn't land in that filtered list, so the trigger falls
through to providersData[*].icon — which goes through the frontend's
hardcoded getProviderIcon lookup. That map didn't include HuggingFace,
so it returned 'Bot' and rendered the lucide robot icon next to the
HF model in the trigger.

Adding HuggingFace -> "HuggingFace" to the lookup makes the trigger
match the dropdown list and the Language Model component.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 1, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 848239e8-e197-483e-9fd6-faa20599910a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the enhancement New feature or request label May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant