-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Describe the bug
OpenSearch 2.17 introduced bitmap filtering (🎉) in #14774, using the java implementation of RoaringBitmap.
Roaring bitmaps are designed to store sets of 32-bit unsigned integers. Java does not have a native unsigned integer type, but the Java implementation accepts (signed) int
across the implementation API surface, with integers still considered to be unsigned and ordered according to Integer.compareUnsigned
.
For example,
RoaringBitmap bitmap = new RoaringBitmap();
bitmap.add(-501577322);
bitmap.add(1278209963);
bitmap.iterator().forEachRemaining(System.out::println);
outputs
1278209963
-501577322
signed integer -501577322 in the Java implementation is treated as unsigned integer 3793389974
int signed = -501577322;
long unsigned = Integer.toUnsignedLong(signed);
Given this behavior, the current use of Roaring bitmaps for bitmap filtering is inconsistent when an input to filtering contains unsigned integer values greater than Integer.MAX_VALUE
(represented as negative integer values in Java implementation - I'm going to use negative values in the context of bitmaps and the Java implementation, and unsigned integer values greater than Integer.MAX_VALUE
, as interchangeable).
when a base64 encoded bitmap input to a terms query contains
- a single negative value, it works as expected
- multiple values where at least one is a positive integer value and one is a negative integer value,
- an exception is thrown on OpenSearch 2.17 and 2.18
- on 2.19+, only integers iterated in the bitmap up to the first negative value are considered when determining matches
Looking at the implementation, the behavior in 2.17 and 2.18 is related to the logic around bitmap iteration to build a PointInSetQuery
OpenSearch/server/src/main/java/org/opensearch/index/mapper/NumberFieldMapper.java
Lines 1511 to 1543 in ac38878
static PointInSetQuery bitmapIndexQuery(String field, RoaringBitmap bitmap) { | |
final BytesRef encoded = new BytesRef(new byte[Integer.BYTES]); | |
return new PointInSetQuery(field, 1, Integer.BYTES, new PointInSetQuery.Stream() { | |
final Iterator<Integer> iterator = bitmap.iterator(); | |
@Override | |
public BytesRef next() { | |
int value; | |
if (iterator.hasNext()) { | |
value = iterator.next(); | |
} else { | |
return null; | |
} | |
IntPoint.encodeDimension(value, encoded.bytes, 0); | |
return encoded; | |
} | |
}) { | |
@Override | |
public Query rewrite(IndexSearcher indexSearcher) throws IOException { | |
if (bitmap.isEmpty()) { | |
return new MatchNoDocsQuery(); | |
} | |
return super.rewrite(indexSearcher); | |
} | |
@Override | |
protected String toString(byte[] value) { | |
assert value.length == Integer.BYTES; | |
return Integer.toString(IntPoint.decodeDimension(value, 0)); | |
} | |
}; | |
} |
and how this is used in PointInSetQuery
ctor that compares values as they are iterated to the previous value, and throws if they are out of order:
The exception stack trace is
"[bitmap-demo/c2Uf79GuRYi0Go3oPI9i6g] QueryShardException[failed to create query: null]; nested: UnsupportedOperationException;
at org.opensearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:590)
at org.opensearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:573)
at org.opensearch.search.SearchService.parseSource(SearchService.java:1373)
at org.opensearch.search.SearchService.createContext(SearchService.java:1109)
at org.opensearch.search.SearchService.executeQueryPhase(SearchService.java:706)
at org.opensearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:679)
at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74)
at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89)
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
at org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78)
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:59)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1005)
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)\nCaused by: java.lang.UnsupportedOperationException
at org.apache.lucene.util.BytesRefBuilder.hashCode(BytesRefBuilder.java:172)
at java.base/java.lang.Object.toString(Object.java:257)
at java.base/java.lang.StringConcatHelper.stringOf(StringConcatHelper.java:467)
at org.apache.lucene.search.PointInSetQuery.<init>(PointInSetQuery.java:116)
at org.opensearch.index.mapper.NumberFieldMapper$NumberType$10.<init>(NumberFieldMapper.java:1528)
at org.opensearch.index.mapper.NumberFieldMapper$NumberType.bitmapIndexQuery(NumberFieldMapper.java:1513)
at org.opensearch.index.mapper.NumberFieldMapper$NumberType$6.bitmapQuery(NumberFieldMapper.java:891)
at org.opensearch.index.mapper.NumberFieldMapper$NumberFieldType.bitmapQuery(NumberFieldMapper.java:1626)
at org.opensearch.index.query.TermsQueryBuilder.doToQuery(TermsQueryBuilder.java:552)
at org.opensearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:117)
at org.opensearch.index.query.QueryShardContext.lambda$toQuery$3(QueryShardContext.java:574)
at org.opensearch.index.query.QueryShardContext.toQuery(QueryShardContext.java:586)\n\t... 16 more\n"
In #16936, the implementation was changed to introduce BitmapIndexQuery
which encapsulates iteration and matching:
OpenSearch/server/src/main/java/org/opensearch/index/mapper/NumberFieldMapper.java
Lines 935 to 951 in 99642ff
@Override | |
public Query bitmapQuery(String field, BytesArray bitmapArray, boolean isSearchable, boolean hasDocValues) { | |
RoaringBitmap bitmap = new RoaringBitmap(); | |
try { | |
bitmap.deserialize(ByteBuffer.wrap(bitmapArray.array())); | |
} catch (Exception e) { | |
throw new IllegalArgumentException("Failed to deserialize the bitmap.", e); | |
} | |
if (isSearchable && hasDocValues) { | |
return new IndexOrDocValuesQuery(new BitmapIndexQuery(field, bitmap), new BitmapDocValuesQuery(field, bitmap)); | |
} | |
if (isSearchable) { | |
return new BitmapIndexQuery(field, bitmap); | |
} | |
return new BitmapDocValuesQuery(field, bitmap); | |
} |
Of particular note, the matches
comparison now simply stops if a packed value (integer bytes) comparison to the query point determines that the query point is after the index point:
OpenSearch/server/src/main/java/org/opensearch/search/query/BitmapIndexQuery.java
Lines 187 to 202 in 370cd8c
private boolean matches(byte[] packedValue) { | |
while (nextQueryPoint != null) { | |
int cmp = comparator.compare(nextQueryPoint.bytes, nextQueryPoint.offset, packedValue, 0); | |
if (cmp == 0) { | |
return true; | |
} else if (cmp < 0) { | |
// Query point is before index point, so we move to next query point | |
iterator.advance(packedValue); | |
nextQueryPoint = iterator.next(); | |
} else { | |
// Query point is after index point, so we don't collect and we return: | |
break; | |
} | |
} | |
return false; | |
} |
which they will be for negative integer values in the bitmap because it is iterated in Integer.compareUnsigned
order.
Related component
Search:Query Capabilities
To Reproduce
On OpenSearch 2.19
PUT /bitmap-demo
{
"mappings": {
"properties": {
"bitmap_values": {
"type": "integer"
}
}
}
}
PUT /bitmap-demo/_bulk
{ "index": { "_id": "1" } }
{ "bitmap_values": [388265405, -501577322, -2080915710] }
{ "index": { "_id": "2" } }
{ "bitmap_values": [-501577322] }
{ "index": { "_id": "3" } }
{ "bitmap_values": [-2080915710] }
{ "index": { "_id": "4" } }
{ "bitmap_values": [1278209963] }
# bitmap of 388265405, -501577322, 1278209963
GET /bitmap-demo/_search
{
"query": {
"terms": {
"bitmap_values": ["OjAAAAMAAAAkFwAAL0wAABriAAAgAAAAIgAAACQAAAC9davvlok="],
"value_type": "bitmap"
}
}
}
returns
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "bitmap-demo",
"_id": "1",
"_score": 1,
"_source": {
"bitmap_values": [
388265405,
-501577322,
-2080915710
]
}
},
{
"_index": "bitmap-demo",
"_id": "4",
"_score": 1,
"_source": {
"bitmap_values": [
1278209963
]
}
}
]
}
}
Note that doc with id 2 is not returned, because -501577322
is returned from the bitmap iterator after 388265405
Another query
# bitmap of -501577322
GET /bitmap-demo/_search
{
"query": {
"terms": {
"bitmap_values": ["OjAAAAEAAAAa4gAAEAAAAJaJ"],
"value_type": "bitmap"
}
}
}
returns
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "bitmap-demo",
"_id": "1",
"_score": 1,
"_source": {
"bitmap_values": [
388265405,
-501577322,
-2080915710
]
}
},
{
"_index": "bitmap-demo",
"_id": "2",
"_score": 1,
"_source": {
"bitmap_values": [
-501577322
]
}
}
]
}
}
Note that docs with ids 1 and 2 are returned that contain value -501577322
in bitmap values.
Expected behavior
Mandatory: Better document the current behavior
Explicitly call out that bitmap filtering works with unsigned 32-bit integers, and how this interacts with the nuances of the java implementation of roaring bitmaps, so that when targeting an integer
field, usable values fall in the range [0 ... 231 -1].
I see a few options to address the current behaviour:
Option 1: Throw an exception if the bitmap contains unsigned integer values greater than Integer.MAX_VALUE
Document that bitmap filtering with bitmap inputs including values that would be represented as negative integer values in Java RoaringBitmap, or targeting fields with negative integer values is not supported, and throw an exception if found in the input, with a clear error message.
Option 2: Support unsigned integer values greater than Integer.MAX_VALUE
If targeting an integer
field, indicate that unsigned integer values greater than Integer.MAX_VALUE
can be supported for bitmap filtering by indexing negative integer values. Internally, iterate bitmaps with ordering based on Integer::compareTo
to support this.
This option could be potentially confusing for users creating roaring bitmaps using implementations that do use unsigned integers, but does provide a larger usable range, and could help support other uses e.g. hashing strings with murmur3_32 to produce bitmaps.
Option 3: Introduce an unsigned_int
type and support bitmap filtering on full range
unsigned_long
is supported, it may be a natural extension to also support unsigned_int
and support bitmap filtering on it.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status