Skip to content

[Bug] Incorrect content hash generated when adding new knowledge entries #6952

@v-vinson

Description

@v-vinson

Description

When calling await knowledge.ainsert() with different text_content values, the system generates the same content_hash for every entry. This causes new data to overwrite existing data instead of being added as unique records.

Steps to Reproduce

  1. Initialize the Knowledge instance: knowledge = Knowledge(vector_db=vector_db)
  2. Call await knowledge.ainsert(text_content="xxx") multiple times in a loop or sequence.
  3. Ensure that each text_content string is unique/different.

Agent Configuration (if applicable)

No response

Expected Behavior

  • A unique content_hash should be generated for each distinct text_content.
  • All entries should be stored independently.

Actual Behavior

  • The content_hash generated is identical for every insertion, regardless of the differing text_content.
  • Consequently, new entries overwrite the previous data instead of being appended.

Screenshots or Logs (if applicable)

No response

Environment

- OS: Windows 11
- Agno Version: [v2.5.9]

Possible Solutions (optional)

No response

Additional Context

I suspect the issue lies in the _build_content_hash function (lines 2167–2183 in knowledge.py). The current logic uses an if-elif chain, which means if content.file_data.type exists, the code skips calculating the hash for content.file_data.content entirely. This causes different content with the same type to generate identical hashes.

Current Code Logic

# For file_data, always add filename, type, size, or content for uniqueness
if content.file_data.filename:
    hash_parts.append(content.file_data.filename)
elif content.file_data.type:
    # Problem: If 'type' exists, it appends the type and skips the content hash below
    hash_parts.append(content.file_data.type)
elif content.file_data.size is not None:
    hash_parts.append(str(content.file_data.size))
else:
    # Fallback: use the content for uniqueness
    # Include type information to distinguish str vs bytes
    content_type = "str" if isinstance(content.file_data.content, str) else "bytes"
    content_bytes = (
        content.file_data.content.encode()
        if isinstance(content.file_data.content, str)
        else content.file_data.content
    )
    content_hash = hashlib.sha256(content_bytes).hexdigest()[:16]  # Use first 16 chars
    hash_parts.append(f"{content_type}:{content_hash}")

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions