Skip to content

[BUG] Bitmap filtering inconsistent behaviour for unsigned integer values greater than Integer.MAX_VALUE #18422

@russcam

Description

@russcam

Describe the bug

OpenSearch 2.17 introduced bitmap filtering (🎉) in #14774, using the java implementation of RoaringBitmap.

Roaring bitmaps are designed to store sets of 32-bit unsigned integers. Java does not have a native unsigned integer type, but the Java implementation accepts (signed) int across the implementation API surface, with integers still considered to be unsigned and ordered according to Integer.compareUnsigned.

For example,

RoaringBitmap bitmap = new RoaringBitmap();
bitmap.add(-501577322);
bitmap.add(1278209963);
bitmap.iterator().forEachRemaining(System.out::println);

outputs

1278209963
-501577322

signed integer -501577322 in the Java implementation is treated as unsigned integer 3793389974

int  signed = -501577322;  
long unsigned = Integer.toUnsignedLong(signed);

Given this behavior, the current use of Roaring bitmaps for bitmap filtering is inconsistent when an input to filtering contains unsigned integer values greater than Integer.MAX_VALUE (represented as negative integer values in Java implementation - I'm going to use negative values in the context of bitmaps and the Java implementation, and unsigned integer values greater than Integer.MAX_VALUE, as interchangeable).

when a base64 encoded bitmap input to a terms query contains

  • a single negative value, it works as expected
  • multiple values where at least one is a positive integer value and one is a negative integer value,
    • an exception is thrown on OpenSearch 2.17 and 2.18
    • on 2.19+, only integers iterated in the bitmap up to the first negative value are considered when determining matches

Looking at the implementation, the behavior in 2.17 and 2.18 is related to the logic around bitmap iteration to build a PointInSetQuery

static PointInSetQuery bitmapIndexQuery(String field, RoaringBitmap bitmap) {
final BytesRef encoded = new BytesRef(new byte[Integer.BYTES]);
return new PointInSetQuery(field, 1, Integer.BYTES, new PointInSetQuery.Stream() {
final Iterator<Integer> iterator = bitmap.iterator();
@Override
public BytesRef next() {
int value;
if (iterator.hasNext()) {
value = iterator.next();
} else {
return null;
}
IntPoint.encodeDimension(value, encoded.bytes, 0);
return encoded;
}
}) {
@Override
public Query rewrite(IndexSearcher indexSearcher) throws IOException {
if (bitmap.isEmpty()) {
return new MatchNoDocsQuery();
}
return super.rewrite(indexSearcher);
}
@Override
protected String toString(byte[] value) {
assert value.length == Integer.BYTES;
return Integer.toString(IntPoint.decodeDimension(value, 0));
}
};
}

and how this is used in PointInSetQuery ctor that compares values as they are iterated to the previous value, and throws if they are out of order:

https://github.com/apache/lucene/blob/0c087dfdd10e0f6f3f6faecc6af4415e671a9e69/lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java#L95-L119

The exception stack trace is

"[bitmap-demo/c2Uf79GuRYi0Go3oPI9i6g] QueryShardException[failed to create query: null]; nested: UnsupportedOperationException;
at org.opensearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:590)
at org.opensearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:573)
at org.opensearch.search.SearchService.parseSource(SearchService.java:1373)
at org.opensearch.search.SearchService.createContext(SearchService.java:1109)
at org.opensearch.search.SearchService.executeQueryPhase(SearchService.java:706)
at org.opensearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:679)
at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74)
at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89)
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
at org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78)
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:59)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1005)
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)\nCaused by: java.lang.UnsupportedOperationException
at org.apache.lucene.util.BytesRefBuilder.hashCode(BytesRefBuilder.java:172)
at java.base/java.lang.Object.toString(Object.java:257)
at java.base/java.lang.StringConcatHelper.stringOf(StringConcatHelper.java:467)
at org.apache.lucene.search.PointInSetQuery.<init>(PointInSetQuery.java:116)
at org.opensearch.index.mapper.NumberFieldMapper$NumberType$10.<init>(NumberFieldMapper.java:1528)
at org.opensearch.index.mapper.NumberFieldMapper$NumberType.bitmapIndexQuery(NumberFieldMapper.java:1513)
at org.opensearch.index.mapper.NumberFieldMapper$NumberType$6.bitmapQuery(NumberFieldMapper.java:891)
at org.opensearch.index.mapper.NumberFieldMapper$NumberFieldType.bitmapQuery(NumberFieldMapper.java:1626)
at org.opensearch.index.query.TermsQueryBuilder.doToQuery(TermsQueryBuilder.java:552)
at org.opensearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:117)
at org.opensearch.index.query.QueryShardContext.lambda$toQuery$3(QueryShardContext.java:574)
at org.opensearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:586)\n\t... 16 more\n"

In #16936, the implementation was changed to introduce BitmapIndexQuery which encapsulates iteration and matching:

@Override
public Query bitmapQuery(String field, BytesArray bitmapArray, boolean isSearchable, boolean hasDocValues) {
RoaringBitmap bitmap = new RoaringBitmap();
try {
bitmap.deserialize(ByteBuffer.wrap(bitmapArray.array()));
} catch (Exception e) {
throw new IllegalArgumentException("Failed to deserialize the bitmap.", e);
}
if (isSearchable && hasDocValues) {
return new IndexOrDocValuesQuery(new BitmapIndexQuery(field, bitmap), new BitmapDocValuesQuery(field, bitmap));
}
if (isSearchable) {
return new BitmapIndexQuery(field, bitmap);
}
return new BitmapDocValuesQuery(field, bitmap);
}

Of particular note, the matches comparison now simply stops if a packed value (integer bytes) comparison to the query point determines that the query point is after the index point:

private boolean matches(byte[] packedValue) {
while (nextQueryPoint != null) {
int cmp = comparator.compare(nextQueryPoint.bytes, nextQueryPoint.offset, packedValue, 0);
if (cmp == 0) {
return true;
} else if (cmp < 0) {
// Query point is before index point, so we move to next query point
iterator.advance(packedValue);
nextQueryPoint = iterator.next();
} else {
// Query point is after index point, so we don't collect and we return:
break;
}
}
return false;
}

which they will be for negative integer values in the bitmap because it is iterated in Integer.compareUnsigned order.

Related component

Search:Query Capabilities

To Reproduce

On OpenSearch 2.19

PUT /bitmap-demo
{
  "mappings": {
    "properties": {
      "bitmap_values": {
        "type": "integer"
      }
    }
  }
}


PUT /bitmap-demo/_bulk
{ "index": { "_id": "1" } }
{ "bitmap_values": [388265405, -501577322, -2080915710] }
{ "index": { "_id": "2" } }
{ "bitmap_values": [-501577322] }
{ "index": { "_id": "3" } }
{ "bitmap_values": [-2080915710] }
{ "index": { "_id": "4" } }
{ "bitmap_values": [1278209963] }

# bitmap of 388265405, -501577322, 1278209963
GET /bitmap-demo/_search
{
  "query": {
    "terms": {
      "bitmap_values": ["OjAAAAMAAAAkFwAAL0wAABriAAAgAAAAIgAAACQAAAC9davvlok="],
      "value_type": "bitmap"
    }
  }
}

returns

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "bitmap-demo",
        "_id": "1",
        "_score": 1,
        "_source": {
          "bitmap_values": [
            388265405,
            -501577322,
            -2080915710
          ]
        }
      },
      {
        "_index": "bitmap-demo",
        "_id": "4",
        "_score": 1,
        "_source": {
          "bitmap_values": [
            1278209963
          ]
        }
      }
    ]
  }
}

Note that doc with id 2 is not returned, because -501577322 is returned from the bitmap iterator after 388265405

Another query

# bitmap of -501577322
GET /bitmap-demo/_search
{
  "query": {
    "terms": {
      "bitmap_values": ["OjAAAAEAAAAa4gAAEAAAAJaJ"],
      "value_type": "bitmap"
    }
  }
}

returns

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "bitmap-demo",
        "_id": "1",
        "_score": 1,
        "_source": {
          "bitmap_values": [
            388265405,
            -501577322,
            -2080915710
          ]
        }
      },
      {
        "_index": "bitmap-demo",
        "_id": "2",
        "_score": 1,
        "_source": {
          "bitmap_values": [
            -501577322
          ]
        }
      }
    ]
  }
}

Note that docs with ids 1 and 2 are returned that contain value -501577322 in bitmap values.

Expected behavior

Mandatory: Better document the current behavior

Explicitly call out that bitmap filtering works with unsigned 32-bit integers, and how this interacts with the nuances of the java implementation of roaring bitmaps, so that when targeting an integer field, usable values fall in the range [0 ... 231 -1].


I see a few options to address the current behaviour:

Option 1: Throw an exception if the bitmap contains unsigned integer values greater than Integer.MAX_VALUE

Document that bitmap filtering with bitmap inputs including values that would be represented as negative integer values in Java RoaringBitmap, or targeting fields with negative integer values is not supported, and throw an exception if found in the input, with a clear error message.

Option 2: Support unsigned integer values greater than Integer.MAX_VALUE

If targeting an integer field, indicate that unsigned integer values greater than Integer.MAX_VALUE can be supported for bitmap filtering by indexing negative integer values. Internally, iterate bitmaps with ordering based on Integer::compareTo to support this.

This option could be potentially confusing for users creating roaring bitmaps using implementations that do use unsigned integers, but does provide a larger usable range, and could help support other uses e.g. hashing strings with murmur3_32 to produce bitmaps.

Option 3: Introduce an unsigned_int type and support bitmap filtering on full range

unsigned_long is supported, it may be a natural extension to also support unsigned_int and support bitmap filtering on it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Search:Query CapabilitiesbugSomething isn't workingdiscussIssues intended to help drive brainstorming and decision makinglucene

    Type

    No type

    Projects

    Status

    🆕 New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions