Skip to content
This repository was archived by the owner on Mar 27, 2021. It is now read-only.

Commit c6b5a6f

Browse files
authored
Implement new Bigtable timeout & retry settings (#733)
* heroic-730 Implement new Bigtable timeout & retry settings Signed-off-by: Peter Kingswell <peterk@spotify.com> * heroic-733 lowered timeouts, added more block comments. - passing all tests - next is to refactor the 5 parameters into a class Signed-off-by: Peter Kingswell <peterk@spotify.com> * heroic-733 updated failing unit test Signed-off-by: Peter Kingswell <peterk@spotify.com> * heroic-733 attempting refactor connection settings into one class so that they're not repeated several times over Signed-off-by: Peter Kingswell <peterk@spotify.com> * updated timeout settings as agreed with @SergeyR on Slack * heroic-733 small settings & comment corrections Signed-off-by: Peter Kingswell <peterk@spotify.com> * heroic-733 small settings & comment corrections Signed-off-by: Peter Kingswell <peterk@spotify.com> * fixed runtime IT error in console & syntax warnings Signed-off-by: Peter Kingswell <peterk@spotify.com> * added tagValuesTruncatedSuggestMany to exercise difference between CI and localhost IT test runs where it works locally but not in CI. Signed-off-by: Peter Kingswell <peterk@spotify.com> * fixed checkstyle error. Signed-off-by: Peter Kingswell <peterk@spotify.com> * fixed checkstyle error. Signed-off-by: Peter Kingswell <peterk@spotify.com> * fixes logging in unit & integration tests. Signed-off-by: Peter Kingswell <peterk@spotify.com> * fixes inconsistent tagValuesTruncatedSuggest behaviour Signed-off-by: Peter Kingswell <peterk@spotify.com> * made the tests in *SuggestBackend*IT.java deterministic Signed-off-by: Peter Kingswell <peterk@spotify.com> * increased time available for deleteSeries() to complete down the road all tests of this kind (timer-based) will need redoing as they're lame and brittle. Signed-off-by: Peter Kingswell <peterk@spotify.com> * refactored 6 retry & timeout settings into new POD class which is then added as a field to BigtableBackend, Module etc. Signed-off-by: Peter Kingswell <peterk@spotify.com> * reduced cohesion between modules by switching implementations from MetricsConnectionSettingsModule to MetricsConnectionSettings Signed-off-by: Peter Kingswell <peterk@spotify.com> * moved MetricsConnectionSettingsModule to Bigtable module since... that's where it belongs, basically. Signed-off-by: Peter Kingswell <peterk@spotify.com> * *actually* moved MetricsConnectionSettingsModule to Bigtable module since...that's where it belongs, basically. Signed-off-by: Peter Kingswell <peterk@spotify.com> * added {@link ...} javadoc Signed-off-by: Peter Kingswell <peterk@spotify.com> * fixed comment Signed-off-by: Peter Kingswell <peterk@spotify.com> * implemented review feedback - removed useless/confusing @JSON annotations - changed fields from Integer to int - also fixed code analysis warnings * Rename .java to .kt * PR feedback - refactored POD BT Java class to Kotlin * implemented a workaround for @Inject not working as documented (seemingly) * improved comments, removed unnecessary subclass * io.grpc -> 1.35.0 & bigtable-client-core -> 1.19.0 * increasing heroic system test startup time by 30s to prevent the intermittent timeouts that are being observed. * fixed Could not find policy 'pick_first' exception. - occurred when a timeout happened - full message: java.lang.IllegalStateException: Could not find policy 'pick_first'. Make sure its implementation is either registered to LoadBalancerRegistry or included in META-INF/services/io.grpc.LoadBalancerProvider from your jar files. * adds integration test for Bigtable timeouts * report the failing test's actual content * fix testBackendTimesOutCorrectly assertion * unit test fix attempt #2 * set Google-default timeout and retry settings. tidied up docs too
1 parent 2e63167 commit c6b5a6f

37 files changed

Lines changed: 1291 additions & 577 deletions

File tree

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,3 +40,4 @@ gradle-app.setting
4040

4141
# direnv config file - https://direnv.net/
4242
/.envrc
43+
logs

build.gradle

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,7 @@ allprojects {
165165

166166
dependency 'io.netty:netty-tcnative-boringssl-static:2.0.28.Final'
167167

168-
dependencySet(group: 'io.grpc', version: '1.16.1') {
168+
dependencySet(group: 'io.grpc', version: '1.35.0') {
169169
entry 'grpc-auth'
170170
entry 'grpc-core'
171171
entry 'grpc-netty'
@@ -200,7 +200,7 @@ allprojects {
200200
entry 'google-cloud-core'
201201
entry 'google-cloud-core-grpc'
202202
}
203-
dependency 'com.google.cloud.bigtable:bigtable-client-core:1.12.1'
203+
dependency 'com.google.cloud.bigtable:bigtable-client-core:1.19.0'
204204

205205
dependency 'com.addthis:stream-lib:3.0.0'
206206
dependency 'org.xerial.snappy:snappy-java:1.1.7.2'
@@ -313,6 +313,7 @@ subprojects {
313313
}
314314

315315
test {
316+
testLogging.showStandardStreams = true
316317
testLogging {
317318
events "passed", "skipped", "failed", "standardOut", "standardError"
318319
outputs.upToDateWhen { false }

docs/content/_docs/config.md

Lines changed: 43 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -132,41 +132,47 @@ Precedence for each flag is defined as the following:
132132
The following features are available:
133133

134134
#### com.spotify.heroic.deterministic_aggregations
135+
135136
{:.no_toc}
136137

137138
Enable feature to only perform aggregations that can be performed with limited resources. Disabled by default.
138139

139140
Aggregations are commonly performed per-shard, and the result concatenated. This enabled experimental support for distributed aggregations which behave transparently across shards.
140141

141142
#### com.spotify.heroic.distributed_aggregations
143+
142144
{:.no_toc}
143145

144146
Enable feature to perform distributed aggregations. Disabled by default.
145147

146148
Aggregations are commonly performed per-shard, and the result concatenated. This enables experimental support for distributed aggregations which behave transparently across shards. Typically this will cause more data to be transported across shards for each request.
147149

148150
#### com.spotify.heroic.shift_range
151+
149152
{:.no_toc}
150153

151154
Enable feature to cause range to be rounded on the current cadence. Enabled by default.
152155

153156
This will assert that there are data outside of the range queried for and that the range is aligned to the queried cadence. Which is a useful feature when using a dashboarding system.
154157

155158
#### com.spotify.heroic.sliced_data_fetch
159+
156160
{:.no_toc}
157161

158162
Enable feature to cause data to be fetched in slices. Enabled by default.
159163

160164
This will cause data to be fetched and consumed by the aggregation framework in pieces avoiding having to load all data into memory before starting to consume it.
161165

162166
#### com.spotify.heroic.end_bucket_stategy
167+
163168
{:.no_toc}
164169

165170
Enabled by default.
166171

167172
Use the legacy bucket strategy by default where the resulting value is at the end of the timestamp of the bucket.
168173

169174
#### com.spotify.heroic.cache_query
175+
170176
{:.no_toc}
171177

172178
Disabled by default.
@@ -492,6 +498,39 @@ batchSize: <int>
492498
# If set, the Bigtable client will be configured to use this address as a Bigtable emulator.
493499
# Default CBT emulator runs at: "localhost:8086"
494500
emulatorEndpoint: <string>
501+
502+
# Reference: https://cloud.google.com/bigtable/docs/hbase-client/javadoc/com/google/cloud/bigtable/config/CallOptionsConfig.Builder
503+
# The amount of milliseconds to wait before issuing a client side timeout for mutation remote
504+
# procedure calls.
505+
# In other words, If timeouts are set, how many milliseconds should pass before a
506+
# DEADLINE_EXCEEDED is thrown. The Google default is 600_000 ms (10 minutes).
507+
# Currently, this feature is experimental.
508+
mutateRpcTimeoutMs: int
509+
510+
# ReadRowsRpcTimeoutMs
511+
# The amount of milliseconds to wait before issuing a client side timeout for readRows streaming remote procedure calls.
512+
# In other words, from https://github.com/hegemonic/cloud-bigtable-client/blob/master/bigtable-client-core-parent/bigtable-client-core/src/main/java/com/google/cloud/bigtable/config/CallOptionsConfig.java :
513+
# The default duration to wait before timing out read stream RPC (default value: 12 hours).
514+
515+
readRowsRpcTimeoutMs: int
516+
517+
# ShortRpcTimeoutMs - The amount of milliseconds to wait before issuing a client side timeout for short remote procedure calls.
518+
# In other words, the default duration to wait before timing out RPCs (default
519+
# value: 60 seconds)
520+
# from https://cloud.google.com/bigtable/docs/hbase-client/javadoc/com/google/cloud/bigtable /config/CallOptionsConfig#SHORT_TIMEOUT_MS_DEFAULT
521+
shortRpcTimeoutMs: int
522+
523+
# MaxScanTimeoutRetries
524+
# The maximum number of times to retry after a scan timeout.
525+
# https://cloud.google.com/bigtable/docs/hbase-client/javadoc/com/google/cloud/bigtable/config/RetryOptions.html#getmaxscantimeoutretries
526+
# Default is 3.
527+
maxScanTimeoutRetries: int
528+
529+
# maxElapsedBackoffMs
530+
# Maximum amount of time we will retry an operation that is failing.
531+
# So if this is 5,000ms and we retry every 2,000ms, we would do 2 retries.
532+
# Default is 60 seconds
533+
maxElapsedBackoffMs: int
495534
```
496535
497536
##### `<bigtable_credentials>`
@@ -542,18 +581,16 @@ Maximum number of distinct groups a single result group may contain.
542581

543582
##### seriesLimit
544583

545-
Maximum amount of time series a single request is allowed to fetch, per cluster (if federated).
584+
Maximum amount of time series a single request is allowed to fetch, per cluster (if federated).
546585

547586
A note: when using resource identifiers this limit only applies to the number of series found in the metadata backend, *not* the total series returned.
548587

549588
It is therefore possible to have a low limit *not* be exceeded with the number of series found in metadata, however, return far more series from the metrics backend when resource identifiers are taken into account (which may trigger additional limits).
550589

551-
##### failOnLimits
590+
##### failOnLimits
552591

553592
When true, any limits applied will be reported as a failure.
554593

555-
556-
557594
### [`<metadata_backend>`](#metadata_backend)
558595

559596
Metadata acts as the index to time series data, it is the driving force behind our [Query Language](docs/query_language).
@@ -717,7 +754,6 @@ sniff: <bool> default = false
717754
nodeSamplerInterval: <duration> default = 30s
718755
```
719756
720-
721757
#### [Memory](#memory)
722758
723759
An in-memory datastore. This is intended only for testing and is definitely not something you should run in production.
@@ -970,6 +1006,7 @@ level: <string> default = TRACE
9701006
```
9711007

9721008
#### Query log output
1009+
9731010
{:.no_toc}
9741011

9751012
Each successful query will result in several output entries in the query log. Entries from different stages of the query. Example output:
@@ -996,6 +1033,7 @@ Each successful query will result in several output entries in the query log. En
9961033
| `data` | Contains data relevant to this query stage. This might for example be the original query, a partial response or the final response.
9971034

9981035
#### Contextual information
1036+
9991037
{:.no_toc}
10001038

10011039
It's possible to supply contextual information in the query. This information will then be included in the query log, to ease mapping of performed query to the query log output.
@@ -1033,7 +1071,6 @@ Enable distributed tracing output of Heroic's operations. Tracing is instrumente
10331071

10341072
A few tags are added to incoming requests such as the java version. If running on GCP, zone and region tags are added as well.
10351073

1036-
10371074
```yaml
10381075
# Probability, between 0.0 and 1.0, of sampling each trace.
10391076
probability: <float> default = 0.01
Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
/*
2+
* Copyright (c) 2015 Spotify AB.
3+
*
4+
* Licensed to the Apache Software Foundation (ASF) under one
5+
* or more contributor license agreements. See the NOTICE file
6+
* distributed with this work for additional information
7+
* regarding copyright ownership. The ASF licenses this file
8+
* to you under the Apache License, Version 2.0 (the
9+
* "License"); you may not use this file except in compliance
10+
* with the License. You may obtain a copy of the License at
11+
*
12+
* http://www.apache.org/licenses/LICENSE-2.0
13+
*
14+
* Unless required by applicable law or agreed to in writing,
15+
* software distributed under the License is distributed on an
16+
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
17+
* KIND, either express or implied. See the License for the
18+
* specific language governing permissions and limitations
19+
* under the License.
20+
*/
21+
package com.spotify.heroic.metric
22+
23+
import com.spotify.heroic.metric.consts.ApiQueryConsts
24+
import org.apache.commons.lang3.builder.ToStringBuilder
25+
import org.apache.commons.lang3.builder.ToStringStyle
26+
import java.util.*
27+
28+
open class MetricsConnectionSettings(
29+
maxWriteBatchSize: Optional<Int>,
30+
mutateRpcTimeoutMs: Optional<Int>,
31+
readRowsRpcTimeoutMs: Optional<Int>,
32+
shortRpcTimeoutMs: Optional<Int>,
33+
maxScanTimeoutRetries: Optional<Int>,
34+
maxElapsedBackoffMs: Optional<Int>
35+
) {
36+
/**
37+
* See [ApiQueryConsts.DEFAULT_MUTATE_RPC_TIMEOUT_MS]
38+
*/
39+
@JvmField
40+
var mutateRpcTimeoutMs: Int
41+
42+
/**
43+
* See [ApiQueryConsts.DEFAULT_READ_ROWS_RPC_TIMEOUT_MS]
44+
*/
45+
@JvmField
46+
var readRowsRpcTimeoutMs: Int
47+
48+
/**
49+
* See [ApiQueryConsts.DEFAULT_SHORT_RPC_TIMEOUT_MS]
50+
*/
51+
@JvmField
52+
var shortRpcTimeoutMs: Int
53+
54+
/**
55+
* See [ApiQueryConsts.DEFAULT_MAX_SCAN_TIMEOUT_RETRIES]
56+
*/
57+
@JvmField
58+
var maxScanTimeoutRetries: Int
59+
60+
/**
61+
* See [ApiQueryConsts.DEFAULT_MAX_ELAPSED_BACKOFF_MILLIS]
62+
*/
63+
@JvmField
64+
var maxElapsedBackoffMs: Int
65+
66+
/**
67+
* See [MetricsConnectionSettings.DEFAULT_MUTATION_BATCH_SIZE]
68+
*/
69+
@JvmField
70+
var maxWriteBatchSize: Int
71+
72+
protected constructor() : this(
73+
Optional.of<Int>(MAX_MUTATION_BATCH_SIZE),
74+
Optional.of<Int>(ApiQueryConsts.DEFAULT_MUTATE_RPC_TIMEOUT_MS),
75+
Optional.of<Int>(ApiQueryConsts.DEFAULT_READ_ROWS_RPC_TIMEOUT_MS),
76+
Optional.of<Int>(ApiQueryConsts.DEFAULT_SHORT_RPC_TIMEOUT_MS),
77+
Optional.of<Int>(ApiQueryConsts.DEFAULT_MAX_SCAN_TIMEOUT_RETRIES),
78+
Optional.of<Int>(ApiQueryConsts.DEFAULT_MAX_ELAPSED_BACKOFF_MILLIS)
79+
) {
80+
}
81+
82+
override fun toString(): String {
83+
return ToStringBuilder(this, ToStringStyle.MULTI_LINE_STYLE)
84+
.append("maxWriteBatchSize", maxWriteBatchSize)
85+
.append("mutateRpcTimeoutMs", mutateRpcTimeoutMs)
86+
.append("readRowsRpcTimeoutMs", readRowsRpcTimeoutMs)
87+
.append("shortRpcTimeoutMs", shortRpcTimeoutMs)
88+
.append("maxScanTimeoutRetries", maxScanTimeoutRetries)
89+
.append("maxElapsedBackoffMs", maxElapsedBackoffMs)
90+
.toString()
91+
}
92+
93+
companion object {
94+
/**
95+
* default number of Cells for each batch mutation
96+
*/
97+
const val DEFAULT_MUTATION_BATCH_SIZE = 1000
98+
99+
/**
100+
* maximum possible number of Cells for each batch mutation
101+
*/
102+
const val MAX_MUTATION_BATCH_SIZE = 100000
103+
104+
/**
105+
* minimum possible number of Cells supported for each batch mutation
106+
*/
107+
const val MIN_MUTATION_BATCH_SIZE = 10
108+
@JvmStatic
109+
fun createDefault(): MetricsConnectionSettings {
110+
return MetricsConnectionSettings()
111+
}
112+
}
113+
114+
init {
115+
// Basically make sure that maxWriteBatchSize, if set, is sane
116+
var maxWriteBatch = maxWriteBatchSize.orElse(DEFAULT_MUTATION_BATCH_SIZE)
117+
maxWriteBatch = maxWriteBatch.coerceAtLeast(MIN_MUTATION_BATCH_SIZE)
118+
maxWriteBatch = maxWriteBatch.coerceAtMost(MAX_MUTATION_BATCH_SIZE)
119+
this.maxWriteBatchSize = maxWriteBatch
120+
121+
this.mutateRpcTimeoutMs =
122+
mutateRpcTimeoutMs.orElse(ApiQueryConsts.DEFAULT_MUTATE_RPC_TIMEOUT_MS)
123+
this.readRowsRpcTimeoutMs =
124+
readRowsRpcTimeoutMs.orElse(ApiQueryConsts.DEFAULT_READ_ROWS_RPC_TIMEOUT_MS)
125+
this.shortRpcTimeoutMs =
126+
shortRpcTimeoutMs.orElse(ApiQueryConsts.DEFAULT_SHORT_RPC_TIMEOUT_MS)
127+
this.maxScanTimeoutRetries =
128+
maxScanTimeoutRetries.orElse(ApiQueryConsts.DEFAULT_MAX_SCAN_TIMEOUT_RETRIES)
129+
this.maxElapsedBackoffMs =
130+
maxElapsedBackoffMs.orElse(ApiQueryConsts.DEFAULT_MAX_ELAPSED_BACKOFF_MILLIS)
131+
}
132+
}

0 commit comments

Comments
 (0)