I cannot get a "Retrieval testing" of "Dataset" working when a vector similarity is set to 1.00 #13751

creasysee · 2026-03-23T19:38:49Z

creasysee
Mar 23, 2026

I've used a simple document with text uploaded to a dataset as file with extension txt:

The company provides a special bonus for every hardworking employee today.

I've checked, the file was parced and the chank was created:
settings.bmp

I've set Similarity threshold to 0 and Vector similarity weight to 1.00 and used a word 'money for test:
retrieval.bmp

and got an empty result. In the same time I've used script for check the Vector similarity directly on model:

import numpy as np
from openai import OpenAI

client = OpenAI(base_url="http://xx.xx.xx.xx:9997/v1", api_key="empty")
# MODEL_UID Xinference
MODEL_UID = "bge-m3-0" 

def get_embedding(text):
    response = client.embeddings.create(input=[text], model=MODEL_UID)
    return np.array(response.data[0].embedding)

def cosine_similarity(v1, v2):
    dot_product = np.dot(v1, v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)
    return dot_product / (norm_v1 * norm_v2)

text_en = "The company provides a special bonus for every hardworking employee today."
query_en = "money"

emb_text_en = get_embedding(text_en)
emb_query_en = get_embedding(query_en)

# Calculates Similarity
sim_en = cosine_similarity(emb_text_en, emb_query_en)

print(f"--- BGE-M3 test results ---")
print(f"EN: '{query_en}' -> '{text_en}'")
print(f"Score: {sim_en:.4f}")

and I've got result:

D:\0>python ./tst3.py
--- BGE-M3 test results ---
EN: 'money' -> 'The company provides a special bonus for every hardworking employee today.'
Score: 0.4941
D:\0>

I've tried the ver 0.24.0 and a nightly build with the same result.

Additional information:
The bge-m3 model was loaded with context window 8K:

Model config XLMRobertaConfig {
  "architectures": [
    "XLMRobertaModel"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "dtype": "float32",
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 8194,
  "model_type": "xlm-roberta",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "output_past": true,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.57.6",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 250002
}

Context width was checked by script:

import numpy as np
from openai import OpenAI

client = OpenAI(base_url="http://xx.xx.xx.xx:9997/v1", api_key="empty")

# 1. Create a text approximately 6000 tokens long
base_text = "word " * 6000 
# 2. Create the same text but with a different word at the very END (at position 6001)
changed_text = base_text + " FINAL"

# Get embeddings from Xinference/OpenAI API
v1 = np.array(client.embeddings.create(input=[base_text], model="bge-m3-0").data[0].embedding)
v2 = np.array(client.embeddings.create(input=[changed_text], model="bge-m3-0").data[0].embedding)

# If cosine similarity is 1.0000, it means the model TRUNCATED the text at 6000 tokens 
# and ignored the difference at the end.
similarity = np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
print(f"Similarity: {similarity:.8f}")

if similarity > 0.999999:
    print("WARNING: Model truncates context (6000 limit detected)!")
else:
    print("OK: Model recognizes changes after 6000 tokens (Long context works).")

and it returns:

D:\0>python ./tst.py
Similarity: 0.90422761
OK: Model recognizes changes after 6000 tokens (Long context works).
D:\0>

I'll provide additional information that can help resolve the issue. Thanks.

@creasysee · 2026-03-23T19:45:32Z

dosubot[bot]
Bot Mar 23, 2026

Hi @creasysee! I'm Dosu and I’m helping the ragflow team.

This is a known issue affecting RAGFlow v0.24.0 and recent versions using the Infinity document engine. The most likely cause is the missing available_int flag on your chunks.

What's happening: When chunks are created, they sometimes don't get the available_int=1 flag set properly. The Infinity search implementation automatically filters for available_int=1 during queries, so chunks without this flag are excluded before similarity scoring even happens—regardless of your similarity threshold settings.

Solutions:

Upgrade Infinity to v0.7.0-dev3 — v0.7.0-dev2 has known compatibility issues with RAGFlow v0.24.0
Re-parse your documents after upgrading to ensure chunks are indexed with the proper flags
Alternatively, try the latest nightly Docker image (not nightly-slim) and re-parse

Your external test showing 0.4941 similarity confirms the embedding model is working correctly—the issue is with chunk filtering, not embeddings. There's no hardcoded minimum similarity in RAGFlow; setting the threshold to 0 should work once the visibility issue is resolved.

To verify this is the issue, you could check if your chunks have the available_int field set in the database.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

5 replies

creasysee Mar 24, 2026
Author

Hi @dosu !

and recent versions using the Infinity document engine.

Sorry, I'm using elasticsearch:8.11.3.

Upgrade Infinity to v0.7.0-dev3

Do you provide docker images for this version? What the tag is? Also, I see in the issue #13742 retrieval settings shows Similarty threshold 0.02 vs 0.00 by me and Vector similarity weight 0.3 vs 1.00 by me. I get a result with these settings, but it will only include the result of a full-text search, but not a vector one.

2. Re-parse your documents after upgrading to ensure chunks are indexed with the proper flags

Sure, I did full docker cleanup several times after each test ver 0.24.0 and nightly build.

Alternatively, try the latest nightly Docker image (not nightly-slim) and re-parse

I did it, doesn't help.

To verify this is the issue, you could check if your chunks have the available_int field set in the database.

Do I need to check this field in the database if I use elasticsearch:8.11.3?

Thanks.

dosubot[bot] Bot Mar 24, 2026

Thanks for the clarification! The Elasticsearch backend has a different issue than the Infinity one I initially mentioned.

The problem: In RAGFlow's Elasticsearch implementation, the query_string text search filter is always applied to KNN queries, regardless of your vector_similarity_weight setting. This means when you search for "money", Elasticsearch filters chunks to only those containing "money" before vector similarity is calculated. Since your text doesn't contain "money", you get empty results.

This is a known limitation reported in issue #12277 — pure vector search (vector_similarity_weight=1.0) doesn't work as expected with Elasticsearch because keyword matching acts as a hard pre-filter.

Why default settings return results: With vector_similarity_weight=0.3, the hybrid search relies more on text matching, so you get keyword-based results. But the vector component is still being blocked by the same filter.

Regarding available_int: Yes, it applies to Elasticsearch too. You can verify with:

GET /your_index/_search
{"query": {"match_all": {}}, "_source": ["available_int"]}

Chunks need available_int=1 to be searchable. But based on your symptoms (keyword search works, vector doesn't), the query_string pre-filter is the more likely culprit.

Workarounds:

Use queries that contain words present in your documents (this isn't truly semantic search, but will return results)
Modify rag/utils/es_conn.py to remove query_string from the KNN filter when vector_similarity_weight >= 1.0 (similar to how OceanBase backend handles it)
Consider switching to Infinity backend which has proper weight-based optimization

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

creasysee Mar 24, 2026
Author

Use queries that contain words present in your documents

It isn't a workaround, because indexing of files takes a long time and GPU takes a lot of energy but vector similarity doesn't work))

Elasticsearch filters chunks to only those containing "money" before vector similarity is calculated. Since your text doesn't contain "money", you get empty results.

How to change it? I need both results with appropriate weights.

creasysee Mar 24, 2026
Author

To verify this is the issue, you could check if your chunks have the available_int field set in the database.

# docker exec -it docker-es01-1 curl -u elastic:infini_rag_flow -X GET "localhost:9200/ragflow_529ba87e26e011f1a7741bc0a6deaf0e/_mapping"

{"ragflow_529ba87e26e011f1a7741bc0a6deaf0e":{"mappings":{"dynamic_templates":[{"int":{"match":"*_int","mapping":{"store":"true","type":"integer"}}},{"ulong":{"match":"*_ulong","mapping":{"store":"true","type":"unsigned_long"}}},{"long":{"match":"*_long","mapping":{"store":"true","type":"long"}}},{"short":{"match":"*_short","mapping":{"store":"true","type":"short"}}},{"numeric":{"match":"*_flt","mapping":{"store":true,"type":"float"}}},{"tks":{"match":"*_tks","mapping":{"analyzer":"whitespace","similarity":"scripted_sim","store":true,"type":"text"}}},{"ltks":{"match":"*_ltks","mapping":{"analyzer":"whitespace","store":true,"type":"text"}}},{"kwd":{"match":"^(.*_(kwd|id|ids|uid|uids)|uid)$","match_pattern":"regex","mapping":{"similarity":"boolean","store":true,"type":"keyword"}}},{"dt":{"match":"^.*(_dt|_time|_at)$","match_pattern":"regex","mapping":{"format":"yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||yyyy-MM-dd_HH:mm:ss","store":true,"type":"date"}}},{"nested":{"match":"*_nst","mapping":{"type":"nested"}}},{"object":{"match":"*_obj","mapping":{"dynamic":"true","type":"object"}}},{"string":{"match":"^.*_(with_weight|list)$","match_pattern":"regex","mapping":{"index":"false","store":true,"type":"text"}}},{"rank_feature":{"match":"*_fea","mapping":{"type":"rank_feature"}}},{"rank_features":{"match":"*_feas","mapping":{"type":"rank_features"}}},{"dense_vector":{"match":"*_512_vec","mapping":{"dims":512,"index":true,"similarity":"cosine","type":"dense_vector"}}},{"dense_vector":{"match":"*_768_vec","mapping":{"dims":768,"index":true,"similarity":"cosine","type":"dense_vector"}}},{"dense_vector":{"match":"*_1024_vec","mapping":{"dims":1024,"index":true,"similarity":"cosine","type":"dense_vector"}}},{"dense_vector":{"match":"*_1536_vec","mapping":{"dims":1536,"index":true,"similarity":"cosine","type":"dense_vector"}}},{"binary":{"match":"*_bin","mapping":{"type":"binary"}}}],"date_detection":true,"properties":{"content_ltks":{"type":"text","store":true,"analyzer":"whitespace"},"content_sm_ltks":{"type":"text","store":true,"analyzer":"whitespace"},"content_with_weight":{"type":"text","index":false,"store":true},"create_time":{"type":"date","store":true,"format":"yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||yyyy-MM-dd_HH:mm:ss"},"create_timestamp_flt":{"type":"float","store":true},"doc_id":{"type":"keyword","store":true,"similarity":"boolean"},"docnm_kwd":{"type":"keyword","store":true,"similarity":"boolean"},"id":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"img_id":{"type":"keyword","store":true,"similarity":"boolean"},"kb_id":{"type":"keyword","store":true,"similarity":"boolean"},"lat_lon":{"type":"geo_point","store":true},"page_num_int":{"type":"integer","store":true},"position_int":{"type":"integer","store":true},"q_1024_vec":{"type":"dense_vector","dims":1024,"index":true,"similarity":"cosine"},"title_sm_tks":{"type":"text","store":true,"analyzer":"whitespace","similarity":"scripted_sim"},"title_tks":{"type":"text","store":true,"analyzer":"whitespace","similarity":"scripted_sim"},"top_int":{"type":"integer","store":true}}}}}

Yes. the field available_int is not present. How to fix it?

creasysee Mar 24, 2026
Author

Regarding available_int: Yes, it applies to Elasticsearch too. You can verify with:
GET /your_index/_search
{"query": {"match_all": {}}, "_source": ["available_int"]}

# docker exec -it docker-es01-1 curl -u elastic:infini_rag_flow -X GET "localhost:9200/ragflow_529ba87e26e011f1a7741bc0a6deaf0e/_search" -H 'Content-Type: application/json' -d '{"query": {"match_all": {}}, "_source": ["available_int"]}'

{"took":13,"timed_out":false,"_shards":{"total":2,"successful":2,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":1.0,"hits":[{"_index":"ragflow_529ba87e26e011f1a7741bc0a6deaf0e","_id":"de0a037de212bf2e","_score":1.0,"_source":{}}]}}

creasysee · 2026-03-24T10:30:49Z

creasysee
Mar 24, 2026
Author

I tried set the field available_int to 1:

# docker exec -it docker-es01-1 curl -u elastic:infini_rag_flow -X POST "localhost:9200/ragflow_529ba87e26e011f1a7741bc0a6deaf0e/_update_by_query?conflicts=proceed" -H 'Content-Type: application/json' -d '{"script":{"source":"ctx._source.available_int = 1","lang": "painless"},"query":{"match_all":{}}}'

it returns:

{"took":189,"timed_out":false,"total":1,"updated":1,"deleted":0,"batches":1,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1.0,"throttled_until_millis":0,"failures":[]}

but in doesn't help, the test 'money' -> 'The company provides a special bonus for every hardworking employee today.' doesn't return results. I tried a different values of Similarity threshold and Vector similarity weight, no results.

1 reply

creasysee Mar 24, 2026
Author

JFYI, the field available_int is present now in the chunk and has value 1, I've sent the "_search" request again:

docker exec -it docker-es01-1 curl -u elastic:infini_rag_flow -X GET "localhost:9200/ragflow_529ba87e26e011f1a7741bc0a6deaf0e/_search" -H 'Content-Type: application/json' -d '{"query": {"match_all": {}}, "_source": ["available_int"]}'

and it returns:

{"took":2,"timed_out":false,"_shards":{"total":2,"successful":2,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":1.0,"hits":[{"_index":"ragflow_529ba87e26e011f1a7741bc0a6deaf0e","_id":"de0a037de212bf2e","_score":1.0,"_source":{"available_int":1}}]}}^

creasysee · 2026-03-26T11:22:23Z

creasysee
Mar 26, 2026
Author

Additional information

I've dumped a variable ranks here for a word "bonus", that exists in a source document and got result (truncated):

ranks is: {'total': 1, 'chunks': [{'chunk_id': 'de0a037de212bf2e', 'content_ltks': 'the compani provid a special bonus for everi hardwork employe today', 'content_with_weight': '\nThe company provides a special bonus for every hardworking employee today.', 'doc_id': '199e9eb826eb11f1a69ed7ec3f1b043f', 'docnm_kwd': 'tst.txt', 'kb_id': '644158a226e111f1a69ed7ec3f1b043f', 'important_kwd': [], 'image_id': '', 'similarity': 0.5851307504086021, 'vector_similarity': 0.6171024999334356, 'term_similarity': 0.5714285720408163, 'vector': [-0.031907577998936176, 0.006040115468204021, -0.03622094094753265, 0.008580501936376096, -0.029973085038363937, -0.021988974697887898, 0.029422292299568654, -0.012456858810037375, 0.03984882012009621, -0.00394645044580102, 0.006941081583499909, -0.004493766394443811, -0.002766891544160899, 0.011725387768819928, 0.05148117672652006, -0.037967140972614284, 0.03689972450956702, -0.015277175139635803, -0.03150311640929431, -0.03667382970452309, -0....

In the same time I've dumped the ranks variable for a word "money" and got a result:

ranks is: {'total': 0, 'chunks': [], 'doc_aggs': []}

Can anyone tell me where to change the code or give any advice on how to implement it in order to get the right result? I need have the vector_similarity value for the word 'money', this was calculated above: 0.4941. Of course, need also other values similarity=0, total, chunks, chunk_id etc...

0 replies

creasysee · 2026-03-27T13:20:57Z

creasysee
Mar 27, 2026
Author

Additional information

I've found where in the source code the path of processing diverges. The word 'bonus' goes up to this line, whereas for the word 'money' it only goes up to this line. It occurs because sim_np has a zero lenght. I'll research it later.

Can anyone tell me where to change the code or give any advice on how to implement it in order to get the right result?

0 replies

creasysee · 2026-03-29T17:44:01Z

creasysee
Mar 29, 2026
Author

To make semantic (vector) search work and find "money" is where "bonus" is written, I've completely removed the must block with query_string from the filters specifically for the vector query.

What should a proper filter (JSON) look like?:

Only the technical restrictions (kb_id and document status) should remain in the filter section, but not the question text itself.

"filter": {
    "bool": {
        /* The section "must" with query_string REMOVED */
        "filter": [
            {
                "terms": {
                    "kb_id": ["644158a226e111f1a69ed7ec3f1b043f"]
                }
            }, 
            {
                "bool": {
                    "must_not": [
                        {
                            "range": {
                                "available_int": { "lt": 1 }
                            }
                        }
                    ]
                }
            }
        ],
        "boost": 0.05
    }
}

What I've changed in the Python code:

In the file where the request is generated (rag/nlp/search.py), I've changed a second call of function thread_pool_exec by adding isVectorOnly=1 at the end:

res = await thread_pool_exec(self.dataStore.search, src, highlightFields, filters, [matchText, matchDense, fusionExpr],
                                                    orderBy, offset, limit, idx_names, kb_ids,
                                                    rank_feature=rank_feature, isVectorOnly=1)

In the file where the request is performed (rag/utils/es_conn.py), I've changed a signature of function search by adding isVectorOnly: int = 0 at the end:

    def search(
            self, select_fields: list[str],
            highlight_fields: list[str],
            condition: dict,
            match_expressions: list[MatchExpr],
            order_by: OrderByExpr,
            offset: int,
            limit: int,
            index_names: str | list[str],
            knowledgebase_ids: list[str],
            agg_fields: list[str] | None = None,
            rank_feature: dict | None = None,
            isVectorOnly: int = 0
    ):

In the same file I've changed line (link) from:

                bool_query.must.append(Q("query_string", fields=m.fields,
                                         type="best_fields", query=m.matching_text,
                                         minimum_should_match=minimum_should_match,
                                         boost=1))

to:

                if isVectorOnly == 0:
                    bool_query.must.append(Q("query_string", fields=m.fields,
                                         type="best_fields", query=m.matching_text,
                                         minimum_should_match=minimum_should_match,
                                         boost=1))
                else:
                    bool_query.must = []

So, I can get correct (I think so) results of vecor search, for example, for the word money:

I'm getting 52.35 Vector similarity when Term similarity has value 0.00.

Thanks for your attention. Cheers.

0 replies

InfiniFlow

I cannot get a "Retrieval testing" of "Dataset" working when a vector similarity is set to 1.00 #13751

Uh oh!

creasysee Mar 23, 2026

Replies: 5 comments · 6 replies

Uh oh!

dosubot[bot] Bot Mar 23, 2026

Uh oh!

creasysee Mar 24, 2026 Author

Uh oh!

dosubot[bot] Bot Mar 24, 2026

Uh oh!

creasysee Mar 24, 2026 Author

Uh oh!

creasysee Mar 24, 2026 Author

Uh oh!

creasysee Mar 24, 2026 Author

Uh oh!

creasysee Mar 24, 2026 Author

Uh oh!

creasysee Mar 24, 2026 Author

Uh oh!

creasysee Mar 26, 2026 Author

Additional information

Uh oh!

creasysee Mar 27, 2026 Author

Additional information

Uh oh!

creasysee Mar 29, 2026 Author

What should a proper filter (JSON) look like?:

What I've changed in the Python code:

creasysee
Mar 23, 2026

Replies: 5 comments 6 replies

dosubot[bot]
Bot Mar 23, 2026

creasysee Mar 24, 2026
Author

creasysee Mar 24, 2026
Author

creasysee Mar 24, 2026
Author

creasysee Mar 24, 2026
Author

creasysee
Mar 24, 2026
Author

creasysee Mar 24, 2026
Author

creasysee
Mar 26, 2026
Author

creasysee
Mar 27, 2026
Author

creasysee
Mar 29, 2026
Author