Skip to content

Secondary Indexing

Eric Robertson edited this page Jun 25, 2015 · 24 revisions

=Preliminary

Examples in this document refer to a feature type with the following attributes

ATTRIBUTE NAME

ATTRIBUTE TYPE

LOCATION

Geometry

PERSONS

Numeric (Long)

RECORD_DATE

Date

INCOME_CATEGORY

String (specific set of values)

AFFILIATION

String (mixed text)

=Problem Description

Find the optimal set of one ore more indices to answer a query. A query requests a data store to find features matching the query’s search criteria. A feature is a single instance of data (e.g record, entity, etc.).

Query Syntax: ECQL http://docs.geoserver.org/stable/en/user/filter/ecql_reference.html Query Example: INTERSECTS(LOCATION, "Polygon-109.0448 37.0004, -102.0424 36.9949, -102.0534 41.0006, -109.0489 40.9996, -109.0448 37.0004") AND RECORD_DATE DURING 2006-11-30T00:30:00Z/2006-12-31T00:30:00Z AND PERSONS BETWEEN 1000000 AND 3000000 AND INCOME_CATEGORY in ('10-15', '15-22')

Possible Indices

  1. Primary Geospatial Index on LOCATION

  2. Primary Geospatial-Temporal Index on LOCATION and RECORD_DATE

  3. Secondary Time or Numeric Index on RECORD_DATE

  4. Secondary Text Index on INCOME_CATEGORY

  5. Secondary Numeric Index on PERSONS

=Type of Secondary Indices

  1. Text

    1. Optimized for Regular Expression

    2. Optimized for equality

  2. Numeric

    1. Range or Single Value

    2. Numeric, Floating Point, Big Integer/Decimal?

  3. Date/Times

    1. Treat as Numeric (GMT milliseconds from epoch) or Alternate Representation (e.g. YYYYMMDDHHMISSMMMMM)

      1. Range or Single Value

  4. Geometry (Alternate from primary)

==Role of the Primary Index

The primary index maintains a copy of each feature. A primary index may have more than copy of feature. Each copy is given its own unique index identifier (ROW ID). A feature may be stored in multiple primary indices.

For GeoWave, the primary index is range based index (e.g. bounding rectangle, bounding time).

.Compound Indices

The generalization of an index should include the ability to create compound keys. Geo-temporal is an example. There may be some limitations. Numeric indices can be managed using techniques for primary indices (ie. Space Filling Curve). Text may be limited to equality. The design should not impede further future development.

Numeric ranges derived from two attributes (e.g. start and end) are considered a type of compound key for this discussion. For the moment, this document will not derive a distinction from a single attribute representative of a range, such as start+end or start+duration.

=Persistence Design

The design in this section assumes Accumulo.

A secondary index is a key derived from an attribute value (or attributes in a compound scenario) associated with a primary index identifier. For example, attribute PERSONS with value 29383 is associated with a list of primary index identifiers of those features that have a PERSONS attribute value matching exactly 29383 for a specific primary index.

==Tables

A each primary index is a separate table. A primary index table is named according to name space and index type. For example, geospatial and geo-temporal indices are separate tables.

Does each type of secondary index reside in its own table?

Yes. The approach is consistent with primary index separation. It can be potentially easier manage multiple tables from the perspective of writes and backups.

The alternative is to maintain a single table using the column family with locality configured to isolate each type of secondary index. This approach reduces the growth of secondary index tables, one per each type of secondary index. It is unclear if there are any tangible performance concerns, as locality groups organize column families in separate files.

What is the basic structure of a table?

ROW ID

CF

CQ

VIS

VALUE

Attribute Value

Attribute Type

Attribute Name

Attribute Visibility

Primary Index ROW ID

Which primary index ROW ID is referenced in a secondary index for features indexed in more than one primary index?

In consistency with approaches used for index and statistic selection, as communicated by the data adapter, the default option is to use the index with the least number if dimensions (e.g. geo-spatial over geo-temporal). In cases where the decision requires a change or there is a conflict (two primaries with same number of dimensions), the adapter provides additional guidance.

Should secondary indices reference primary index ROW IDs for all primary indices?

In this solution, only one ID is required to look up a feature. Extra IDs are redundant, requiring more space and logic. The chosen approach does prevent intersecting joins of lookups between secondary and primary indices. This decision is discussed next.

A primary index may have more than one ROW ID per feature. Does the secondary index reference all ROW IDs?

The secondary index references only one selected ROW ID. The index references the first ROW ID, lexicographically.

Concerns on efficiently managing the a single table with respect to writes, backups and table level functions D secondary index - how we persist them - (simplest case is what we do with feature id now - basically another table where the row id is the feature id, and the value is the row-id in the main geowave table) - how we index has an impact on the ability to accelerate some conditions (LIKE/ILIKE (case sensitive/insensitive prefix matching). We don’t have to necessarily support every possible condition - just need to know what we are and aren’t supporting.

statistics - information about cardinality (number of unique values), distribution (histogram), ranges (min/max), etc. for each indexed attribute. - how statistics are set, managed, modified, etc. - making sure the whole workflow "works" - also, exposed in geoserver? - Eric has already done a good bit of this

query planning - I think what we want is basically a cost based optimizer - Basically this is going to take the ECQL portion of a query and determine the best use of indexes to satisfy the query - http://docs.geoserver.org/latest/en/user/filter/ecql_reference.html - no joins for us to optimize across, so problem set is much simpler - typical decision might be something like "where carmake = "toyota" and carcolor = "white""  —  What’s the cardinality of carmake - 100 out of 200 (total count) features? That means, on average, a particular value migh only comprise 2 rows. (if we have histogram values we can do better here). So say a cost of 2.  — What’s the cardinality of carcolor = 2 out of 200? Typical value might still cost us 100 rows? Might be a threshold where it’s not worth using an index - have to experimentally determine where that’s at.  — So in this case the optimizer might decide to do an index lookup for carmake, but just do a full scan on the results of that for carcolor. - Typical choices here, since we don’t have to do joins, are basically going to be  — Do I use an index or not  — Which stats are the best to use to answer the question above. - I think we start out with some sort of cost factor, but will have to run some benchmarks to determine what the final cost factors are (could even turn this into an auto-tune system for different installations) - Would be nice to have a log output optionally available (at some log level) for the results of the optimizer.

predicate pushdown - for the temporal and spatial predicates - (before, during, after, DWITHIN, TOUCHES, etc.) - implementing logic to accelerate these with indexes already in place

Clone this wiki locally