Retrieval Test does not consistently return chunk with highest similarity score. #12601

TimD3 · 2026-01-13T16:03:24Z

TimD3
Jan 13, 2026

Im playing around with Ragflow and noticed a behaviour that appears odd to me. I have setup a dataset from crawled webpages, converted to markdown format and am using a basic configuration. Mostly everything default + a few auto keywords and autoquestions. When doing retrieval testing there is one query where I had trouble to get it to retrieve the correct chunk.

I wanted to see what the similarity for that chunk is, thus I disabled all files except for the file that contained the target chunk and redid the retrieval test. Turns out the similarity was actually quite high (69.24). However when I reactivate all files in the dataset and perform the same retrieval test, not only does it not find the correct chunk, all the chunks it does show me have similarity scores lower than 69.24, ranging from 57 to 50 or so.

I can get similar behaviour by increasing the similarity threshold value. E.g. at 0.2 it only finds the incorrect chunks with similarity scores lower than 60 but if I set similarity threshold to 0.6 it does manage to find the correct chunk and only this one.

This is counterintuitive to me. Is this expected behaviour, does Ragflow somehow not always search through all chunks in a dataset or is there something else I am missing? I do not use any advanced features such as RAPTOR, knowledge graphs, a reranking model etc

@TimD3 · 2026-01-13T16:08:23Z

dosubot[bot]
bot Jan 13, 2026

Hi @TimD3! I'm Dosu and I’m helping the ragflow team.

This is expected behavior in Ragflow and comes down to how retrieval scoring and filtering work. Ragflow always searches all enabled chunks, but the chunks returned depend on the hybrid similarity score, which combines weighted keyword similarity, vector similarity, and a rank feature (like pagerank or tag features). The default weights and the similarity threshold can cause some chunks to be excluded even if they have a high raw similarity in isolation.

When you enable only one file, the system scores just those chunks, so your target chunk stands out and passes the threshold. When you enable all files, the retrieval process ranks all chunks and only returns those above the threshold or within the top-N (default is 8 or 10). If your target chunk's hybrid score is just below the threshold or not in the top-N, it won't appear—even if its raw similarity is high in isolation. Small changes in threshold or weights can cause borderline chunks to be included or excluded, which explains the behavior you see. Also, the rank feature can lower the final score for some chunks, so a chunk with a high raw similarity might still lose out if its rank feature is low or if the weights favor another component source source.

To get more consistent results, try lowering the similarity threshold, increasing the top-N parameter, or adjusting the keyword/vector weights. Make sure your preprocessing is consistent between queries and stored chunks. If you want to guarantee retrieval of a specific chunk, set a higher threshold or filter more aggressively source source.

There was also a recent bug fix in the retrieval component (merged 2025-12-26), so if you're running an older version, consider updating source.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

2 replies

TimD3 Jan 13, 2026
Author

I dont think this answer is helpful, Im not using pagerank, so this should not affect the test and my target chunk shows the single highest similarity value, how can it be filtered away? On the contrary, it only starts showing up, when I set the similarity threshold value high enough. This unfortunately makes the results on many other queries worse. I would just like to obtain the expected behaviour of the chunk with the highest similarity score, regardles of how exactly that value is calculated between keyword and vector similarity, to actually also show up as the top one in the retrieval test.

dosubot[bot] bot Jan 13, 2026

This behavior comes from how Ragflow applies similarity thresholding and top-N selection during retrieval. Even if your target chunk has the highest similarity score, it can be filtered out if its score is just below the configured similarity threshold, or if it doesn't make it into the top-N results after filtering and sorting. The retrieval logic first filters out all chunks below the threshold, then sorts the remaining ones by similarity, and finally returns only the top-N (default is 8 or 10) see code.

If you set the threshold too low, you may get many irrelevant chunks; if you set it too high, only the very top chunk(s) will appear, which can hurt recall for other queries. There isn't a built-in mode to always guarantee the single highest-scoring chunk is returned regardless of threshold—Ragflow expects you to tune both the similarity_threshold and top_n parameters for your use case. To maximize the chance that the highest similarity chunk is always included, set top_n high enough and similarity_threshold low enough to include all plausible candidates, then sort by similarity in your application logic if you need a strict "top-1" result.

If you want to guarantee that the chunk with the highest similarity score is always returned, you may need to set similarity_threshold very low (or zero) and set top_n to at least 1, then select the top result by similarity score yourself. This is a known limitation of the current design discussion.

If you want more granular control, there is community recognition that separating the threshold for search and rerank phases would help, but this is not yet standard in the main pipeline discussion.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

sarabhan · 2026-01-14T04:45:45Z

sarabhan
Jan 14, 2026

hi there @TimD3 ! while this does seem pretty weird at first, it’s actually pretty normal.
the thing is; ragflow doesnt rely solely on similarity score. it uses a mix of ANN and cosine similarity score.

essentially this is how ragflow works:

The ANN part

ragflow first calculates the distance metric between vector embeddings of the query and the data chunks.
then it shortlists the chunks based on this distance metric

The Similarity part

next, the cosine similarity is calculated only for those shortlisted chunks.
as per what similarity threshold you provide, it finally filters which chunks to refer to answer your query

when you remove all other chunks from you dataset, you are essentially reducing the initial search space for the ann algorithm, making the correct chunk more 'reachable'. this time, there is only 1 data chunk to calculate the distance metric w.r.t the query. ann is now forced to shortlist the only available chunk for data retrieval, which is why you get the correct answer.

like you mentioned, all other chunks have similarity < 0.6. hence when you increase the similarity threshold to 0.6, the first shortlisted set of chunks is no more eligible for data retrieval. the ann moves to shortlist a new set, that subsequently includes the correct chunk this time.

although increasing the similarity threshold did give you the required answer for 1 query, it worsens the overall performance for obvious reasons.

remember that all LLMs and RAG algorithms dont really 'understand' your query. at the end of the day, they only represent words in vectors, calculate the similarity of those vectors, and get you the answer. if your query isnt specific enough or lacks necessary keywords, you will always get vague answers.

do let me know if you have any more doubts!

2 replies

TimD3 Jan 14, 2026
Author

Thanks a lot for your answer! By ANN I assume you mean approximate nearest neighbor? So the problem is that the search never considered my chunk in the first place, which changes when either the dataset to search over is reduced or when the similarity threshold is set high enough since then when too many candidates get filtered away, the ANN actually restarts to search more chunks that it has not previously considered? Is that correct?

Do you know whether its possible to configure the search algorithm more or switch to an exact nearest neighbor search or smth like this?

sarabhan Jan 16, 2026

Yes, ann here means approximate nearest neighbor, and you’ve got the core idea right; the chunk likely wasn’t considered at all in the initial search, so its similarity never got computed. both reducing the dataset and increasing the threshold effectively changed how the search explores the space, which made previously overlooked chunks reachable, that’s why the correct one suddenly showed up.

i should mention though that i haven’t worked very deeply with ragflow internals yet, so i’m not fully sure how much of this behavior is configurable from the outside. It might be worth checking the backend or docs to see what parameters are exposed for tuning.

if ragflow allows it, one thing you could experiment with is increasing the number of candidates considered for similarity scoring (like a knn style setup). that said, it’s probably a tradeoff; higher k can help with cases like this, but may work inefficiently for broader or more generic queries like 'give me a summary of the entire dataset'. Finding a good balance would likely depend on the dataset and query patterns.

it would be interesting to hear if you find anything that helped improve the performance of your rag pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InfiniFlow

Retrieval Test does not consistently return chunk with highest similarity score. #12601

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

InfiniFlow

Retrieval Test does not consistently return chunk with highest similarity score. #12601

Uh oh!

TimD3 Jan 13, 2026

Replies: 2 comments · 4 replies

Uh oh!

dosubot[bot] bot Jan 13, 2026

Uh oh!

TimD3 Jan 13, 2026 Author

Uh oh!

dosubot[bot] bot Jan 13, 2026

Uh oh!

Uh oh!

sarabhan Jan 14, 2026

Uh oh!

TimD3 Jan 14, 2026 Author

Uh oh!

sarabhan Jan 16, 2026

TimD3
Jan 13, 2026

Replies: 2 comments 4 replies

dosubot[bot]
bot Jan 13, 2026

TimD3 Jan 13, 2026
Author

sarabhan
Jan 14, 2026

TimD3 Jan 14, 2026
Author