feat: add HuggingFace as a local-first model provider (GGUF/llama-cpp)#12956
Draft
Empreiteiro wants to merge 12 commits intolangflow-ai:mainfrom
Draft
feat: add HuggingFace as a local-first model provider (GGUF/llama-cpp)#12956Empreiteiro wants to merge 12 commits intolangflow-ai:mainfrom
Empreiteiro wants to merge 12 commits intolangflow-ai:mainfrom
Conversation
Registers HuggingFace alongside the other configurable providers in the unified model catalog, but runs models locally via langchain-huggingface's HuggingFacePipeline + transformers — no external API calls required. - Default bundled model: HuggingFaceTB/SmolLM2-360M-Instruct (~720MB), small and CPU-friendly so a fresh install can answer prompts after the first lazy download. - Catalog ships small instruct checkpoints (SmolLM2 135M/1.7B, Qwen2.5 0.5B/1.5B) plus larger gated options (Llama-3.2 1B/3B, Phi-3.5-mini). - HUGGINGFACEHUB_API_TOKEN is optional — only needed to pull gated repos. - Providers with no required variables now stay enabled by default so the HF entry surfaces without the user having to configure credentials. - New endpoints: GET /api/v1/models/huggingface/installed lists repos present in the local Hub cache, and POST /api/v1/models/huggingface/ download eagerly fetches a model via huggingface_hub.snapshot_download (reusing the user's saved token for gated downloads). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The HuggingFace provider failed at build time with "HUGGINGFACEHUB_API_TOKEN variable not found." even though its token is documented as optional. Root cause: apply_provider_variable_config_to_build_config unconditionally set load_from_db=True with the canonical variable key on the provider's api_key field, so the runtime tried to resolve a value that the user had never configured and raised. For *required* provider variables the behavior is unchanged. For optional ones (top-level required=False) we now only auto-install load_from_db=True when the variable is actually present in the user's globals or in the process environment; otherwise we leave the field empty so the runtime gets a None api_key (which the local HuggingFace adapter handles fine). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…toggle Onboarding simplification: the HuggingFace catalog now ships exactly one model (HuggingFaceTB/SmolLM2-360M-Instruct, ~720MB, fast on CPU). The Settings → Model Providers screen shows a single toggle for it, defaulting to ON. When the user flips a HuggingFace model toggle on, POST /enabled_models now schedules a background snapshot_download into the local Hub cache so the first flow invocation doesn't pay the cold-start latency. Failures are logged but never block the toggle from being saved. Strong refs to in-flight tasks live at module scope to satisfy RUF006. The unified catalog's "first 5 are default" auto-promotion now defers to explicit per-model `default=True` declarations when any are present, so HuggingFace gets exactly the bundled model on by default while the other providers keep their existing behavior. Additional HF models can still be installed via POST /api/v1/models/huggingface/download with an arbitrary repo id. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The earlier "honour explicit default=True per model" change broke the existing TestUnifiedModelsDefaults invariants: - IBM WatsonX declares default=True on all 7 of its models, so honouring the explicit flags returned 7 defaults where the test expects ≤5. - Google Generative AI doesn't declare any explicit defaults, so the fallback path was the only one exercised — but the override still changed the contract. Revert to the original "first 5 models per provider are default" behavior. The HuggingFace onboarding goal (single bundled model toggled on by default) is satisfied automatically because HUGGINGFACE_MODELS_DETAILED now contains exactly one entry, and i=0 < 5 lands it in the default set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The local pipeline was triggering worker SIGSEGV on first model load on macOS arm64 + Python 3.12. The crash happened at the very start of the weight download (0% progress), which points at torch's device init path running inside a forked uvicorn worker rather than the download itself. - device=-1 — force CPU and skip MPS/CUDA negotiation, which is the most fragile leg of torch on first-import-after-fork. - low_cpu_mem_usage=True — stream weights through the model during from_pretrained instead of double-buffering them, lowering peak RAM. If the SIGSEGV still happens, the workaround is to start the server with OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES exported (a known torch+Objective-C fork-safety interaction on macOS, not specific to this adapter). Documented inline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first flow run that uses the local HuggingFace provider would block the request thread for tens of seconds while transformers pulled ~720MB to ~/.cache/huggingface. Worse, on macOS arm64 + Python 3.12 the load inside a uvicorn worker can SIGSEGV on torch's device init. Pre-warming the cache during lifespan startup uses huggingface_hub.snapshot_download exclusively (no torch import), so it cannot trigger the worker SIGSEGV — and by the time the user sends the first message, the weights are already on disk and the inference path only pays the load + generate cost. - Runs as a background task; tracked alongside sync_flows_from_fs_task and mcp_init_task and cancelled on lifespan shutdown. - Skippable via LANGFLOW_SKIP_HF_DEFAULT_DOWNLOAD=true (1/yes also work). - Forwards HUGGINGFACEHUB_API_TOKEN if set in env so gated default models would still pull. - Failures are logged at warning and never block startup; the first inference call will retry the download on demand. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tened
The startup prefetch was triggering a server crash loop on macOS arm64:
huggingface_hub.snapshot_download itself segfaulted at 0% (parallel
download backend interacting badly with forked uvicorn workers), the
worker died, uvicorn auto-reload restarted, and the cycle repeated. The
log also showed a "4.66 GB" total because the unfiltered snapshot pulled
every weight format the repo carries (safetensors + pytorch_model.bin +
ONNX + GGML).
Two changes:
1. Flip prefetch to opt-in: LANGFLOW_PREFETCH_HF_DEFAULT=true (was
"skip" via LANGFLOW_SKIP_HF_DEFAULT_DOWNLOAD). Default is now OFF so
a fresh install never crash-loops; users who actually want the warm
cache enable it explicitly.
2. Harden download_model:
- allow_patterns restricts the snapshot to safetensors + tokenizer +
config (no pytorch_model.bin, ONNX, GGML, etc.) - typically cuts
download size by 4-6x.
- max_workers=1 serializes file fetches; the multi-thread path is
what was crashing inside the worker on macOS arm64.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (GGUF)
The transformers + torch path was unsalvageable on macOS arm64 + Python
3.12: both the inference load (torch device init in a forked uvicorn
worker) and the snapshot_download parallel fetcher SIGSEGV'd, with no
Python-level recovery possible. Mitigations like device=-1,
low_cpu_mem_usage, max_workers=1, and the OBJC fork-safety env var only
narrowed the failure window without closing it.
This commit replaces the backend wholesale:
- ChatHuggingFace now produces a langchain_community ChatLlamaCpp
(llama-cpp-python under the hood). No torch import, no fork-safety
pitfall, fast on CPU thanks to quantization.
- The bundled default flips from
HuggingFaceTB/SmolLM2-360M-Instruct (~720MB safetensors)
to
bartowski/SmolLM2-360M-Instruct-GGUF, file
SmolLM2-360M-Instruct-Q4_K_M.gguf (~270MB).
Smaller download, similar quality, runs in <500MB RAM.
- download_model uses hf_hub_download for a single .gguf file (no
snapshot_download, no parallel fetcher).
- list_installed_models now filters cache entries to repos that actually
contain a .gguf file we can load.
- A small in-process cache keys ChatLlamaCpp instances by
(model_path, temperature, max_tokens) so repeat calls reuse the same
mmaped model instead of reloading.
- Filename selection: catalog overrides via GGUF_FILENAME_BY_REPO; fallback
is "<model-name>-Q4_K_M.gguf" which works for all bartowski-style repos.
llama-cpp-python is already in langflow-base[llama-cpp] (already part of
the full langflow install).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…allback) The single-file hf_hub_download path was still crashing uvicorn workers on macOS arm64. Two complementary fixes: 1. Disable accelerated backends at module import time. xet and hf_transfer both spawn worker threads/processes whose fork-safety is broken on this platform. Forcing the plain HTTP path is more than fast enough for ~270MB GGUFs. - HF_HUB_DISABLE_XET=1 - HF_HUB_ENABLE_HF_TRANSFER=0 - HF_HUB_DISABLE_TELEMETRY=1 (drops one more import) - HF_HUB_DISABLE_PROGRESS_BARS=1 (drops tqdm in worker context) 2. If the in-process call still raises, retry the exact same hf_hub_download in an isolated subprocess via subprocess.run. A child process that crashes can't take the parent uvicorn worker with it; the parent recovers, logs the failure, and propagates the path captured from stdout when the subprocess succeeds. 600s timeout to bound network stalls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
llama-cpp-python lives in the [local] extra of langflow-base but wasn't included in [complete], so the langflow main install (which pulls langflow-base[complete]) shipped without it. The new HuggingFace local provider needs llama-cpp-python at runtime, so the user saw: ImportError: Could not import llama-cpp-python library. Pulling [local] into [complete] makes the full langflow install include it without bloating bare langflow-base setups (which can still skip it). For an existing dev env, install on demand: uv pip install llama-cpp-python or: uv sync --reinstall Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Makefile install_backend target uses 'uv sync --frozen', which only installs what's in uv.lock and ignores fresh additions to pyproject.toml. Without regenerating the lockfile, contributors running 'make run_cli' or 'make backend' wouldn't get llama-cpp-python even though [complete] now references [local]. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Language Model component renders the HF logo correctly because it reads icon directly from the backend's option metadata. The Agent component filters models by tool_calling=True; HF (which doesn't claim tool_calling) doesn't land in that filtered list, so the trigger falls through to providersData[*].icon — which goes through the frontend's hardcoded getProviderIcon lookup. That map didn't include HuggingFace, so it returned 'Bot' and rendered the lucide robot icon next to the HF model in the trigger. Adding HuggingFace -> "HuggingFace" to the lookup makes the trigger match the dropdown list and the Language Model component. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
llama-cpp-python+ GGUF (no torch, no transformers, fork-safe on macOS arm64).bartowski/SmolLM2-360M-Instruct-GGUF(Q4_K_M, ~270MB). Settings → Model Providers shows one toggle for it, on by default.POST /enabled_modelsschedules a single-file download so the cache is warm by the next invocation.POST /api/v1/models/huggingface/download {"model_id": "<gguf-repo-id>"}accepts any HF repo id that publishes GGUF weights.LANGFLOW_PREFETCH_HF_DEFAULT=truewarms the cache on lifespan start. Off by default.hf_hub_downloadever crashes the worker, a subprocess retry kicks in so the parent uvicorn worker survives.Why GGUF / llama-cpp instead of transformers
The first iteration used
langchain_huggingface.HuggingFacePipeline(transformers + torch). On macOS arm64 + Python 3.12, two SIGSEGV vectors broke this end-to-end:huggingface_hub.snapshot_download's parallel fetcher SIGSEGV'd at 0% download progress.Mitigations (
device=-1,low_cpu_mem_usage=True,max_workers=1,OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES) only narrowed the failure window. Switching tollama-cpp-pythonremoves the torch dependency entirely — it's a C/C++ backend with a thin Python binding, fork-safe, and quantized models are smaller and faster on CPU.What changed
Provider registry & catalog
src/lfx/src/lfx/base/models/huggingface_constants.py(new) — single bundled GGUF entry (bartowski/SmolLM2-360M-Instruct-GGUF, default=True).src/lfx/src/lfx/base/models/model_metadata.py— registersHuggingFaceinMODEL_PROVIDER_METADATA.HUGGINGFACEHUB_API_TOKENis optional (only needed for gated repos).src/lfx/src/lfx/base/models/unified_models/provider_queries.py— wiresHUGGINGFACE_MODELS_DETAILEDintoget_models_detailed().src/lfx/src/lfx/base/models/unified_models/class_registry.py— addsChatHuggingFaceto the lazy import map.Local chat adapter (llama-cpp backend)
src/lfx/src/lfx/base/models/huggingface_chat_model.py(new) — factoryChatHuggingFacetranslates the unified kwargs into alangchain_community.chat_models.ChatLlamaCppinstance. Picks the right.gguffilename via a per-repo override (GGUF_FILENAME_BY_REPO) with a<name>-Q4_K_M.gguffallback for bartowski-style repos._LLAMA_CACHE) keys instances by(model_path, temperature, max_tokens)so repeat calls don't re-mmap the model.download_model()useshf_hub_download(single-file). Disablesxet,hf_transfer, telemetry, and progress bars at module import to avoid worker-fork SIGSEGV. If the in-process call still raises, retries the same call inside an isolated subprocess so a hard crash can't take down the parent worker.list_installed_models()filters cache entries to repos that actually contain a.gguffile.n_threads = cpu_count() - 1,n_gpu_layers = 0by default (CPU-only).Local-provider plumbing
src/lfx/src/lfx/base/models/unified_models/instantiation.py—HuggingFacejoinsOllamain the "no API key required" exception.src/lfx/src/lfx/base/models/unified_models/credentials.py— providers whose variables are all optional now stay enabled by default; also adds an HF token validator that pingshuggingface.co/api/whoami-v2.src/lfx/src/lfx/base/models/unified_models/build_config.py— for optional provider variables, only auto-installload_from_db=Truewhen the variable is actually configured. Fixes"<VAR> variable not found."errors at flow build time.API
src/backend/base/langflow/api/v1/models.pyGET /models/huggingface/installed— list GGUF repo ids in the local Hub cache.POST /models/huggingface/download— download a.gguffile for the given repo id (validates length, reuses saved token for gated pulls).POST /enabled_modelsschedules a backgroundhf_hub_downloadwhenever a HuggingFace model is toggled on. Failures are logged but never block the toggle save.Lifespan startup
src/backend/base/langflow/main.py—_prefetch_default_huggingface_model()runs as a background task during lifespan startup only whenLANGFLOW_PREFETCH_HF_DEFAULT=true. Tracked alongsidesync_flows_from_fs_taskandmcp_init_task; cancelled on lifespan shutdown.Frontend
src/frontend/src/controllers/API/queries/models/use-get-model-providers.ts— addsHuggingFaceto the hardcoded provider→icon lookup. Without this, the Agent component (which filterstool_calling=Trueand so doesn't see the HF model in its options list) was falling through to the defaultBotlucide icon for HF entries while the Language Model component rendered the correct HF logo.Packaging
src/backend/base/pyproject.toml— addslangflow-base[local]to the[complete]extra so freshpip install langflowinstalls pullllama-cpp-pythonautomatically.uv.lock— regenerated. The Makefile usesuv sync --frozenformake backend/make run_cli, which only installs what's pinned in the lockfile; without this regeneration contributors wouldn't get the new dep on a fresh checkout.How to use
make backend. Server boots clean, no prefetch by default.hf_hub_downloadof the GGUF file. First invocation from a flow loads the GGUF and answers.curl -X POST http://localhost:7860/api/v1/models/huggingface/download \ -H 'Content-Type: application/json' \ -d '{"model_id": "bartowski/Qwen2.5-0.5B-Instruct-GGUF"}'Qwen2.5-0.5B-Instruct-Q4_K_M.ggufvia the bartowski-style heuristic. For non-conforming repos, add an entry toGGUF_FILENAME_BY_REPO.HUGGINGFACEHUB_API_TOKENin Settings → Global Variables. Required for the few GGUF repos that gate downloads.LANGFLOW_PREFETCH_HF_DEFAULT=true make backend.Tool calling
The bundled
SmolLM2-360M-Instructdoes not have reliable tool calling — it's markedtool_calling=Falsein the catalog. To use the HuggingFace provider with the Agent component's tools, install a tool-calling-capable GGUF on demand, e.g.:curl -X POST http://localhost:7860/api/v1/models/huggingface/download \ -H 'Content-Type: application/json' \ -d '{"model_id": "bartowski/Hermes-3-Llama-3.2-3B-GGUF"}'Dependencies
llama-cpp-python— already inlangflow-base[local], now pulled by[complete](whichlangflowconsumes).langchain-community— already a core dep; providesChatLlamaCpp.huggingface_hub— already a core dep; provideshf_hub_download.No new top-level dependencies. The transformers / torch / langchain-huggingface stack is no longer touched by the HF chat path.
Env var summary
LANGFLOW_PREFETCH_HF_DEFAULTfalsetrue/1/yes, prefetch the bundled GGUF during lifespan startup.HUGGINGFACEHUB_API_TOKENHF_HUB_DISABLE_XET1(forced)HF_HUB_ENABLE_HF_TRANSFER0(forced)Test plan
ruff check+ruff formatclean across all modified files.test_max_tokens_propagation.py,test_model_input_fixes.py) — 37 passed, 1 skipped.bartowski/SmolLM2-360M-Instruct-GGUF, default=True);_pick_gguf_filenameresolves toSmolLM2-360M-Instruct-Q4_K_M.gguf;list_installed_models()reads the real cache.POST /models/huggingface/downloadwith a different bartowski-style GGUF repo → confirm download → flow can switch to it.bartowski/Hermes-3-Llama-3.2-3B-GGUF) with the Agent component and confirm Calculator/URL tools fire.bartowski/Llama-3.2-3B-Instruct-GGUF) with a saved HF token.LANGFLOW_PREFETCH_HF_DEFAULT=trueon a fresh cache actually warms it during lifespan startup without crashing.🤖 Generated with Claude Code