Skip to content

[BUG] Fix duplicate or old message skipping logic in pull-based ingestion when versioning is not used #19591

@varunbharadwaj

Description

@varunbharadwaj

Describe the bug

This bug is only applicable if versioning is not enabled and we rewind back in time. Though versioning is recommended, it is still possible to ingest without versions today.

If versioning is not used in pull-based ingestion and the streaming source pointer is rewinded back in time, it is possible to skip the latest available message for a document at that time.

  1. Assume we have multiple versions of a document present in the stream, and all of them are persisted without versions.
  2. If the consumer is explicitly rewinded back in time, to reprocess all the messages from step 1, messages determined to be duplicates (previously processed) are skipped.
  3. The issue is the poller is only aware of latest offsets for a given document. This results in processing older messages while skipping the latest message for a given document.

More details will follow.
The fix: Remove persisted pointer concept and rely on versioning to ensure consistent view of docs on rewind. Pull-based ingestion will provide atleast once processing guarantee when versioning is not used.

Related component

Indexing

To Reproduce

Create pull-based index without versioning. Have multiple updates for a document. Rewind to early offset and ensure there is no new version of documents published after rewind. We will skip the latest available message known to the shard.

Expected behavior

Reflect latest version of a document that is seen without skipping valid messages even if versioning is not used.

Additional Details

Plugins
ingestion-kafka, ingestion-kinesis

Metadata

Metadata

Labels

IndexingIndexing, Bulk Indexing and anything related to indexingbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions