@@ -17,7 +17,7 @@ It is capable of using the following Thrift transports:
17
17
18
18
As of version 1.0, it supports asynchronous execution of queries. This allows you to submit
19
19
a query, disconnect, then reconnect later to check the status and retrieve the results.
20
- This frees systems of the need to keep a persistent TCP connection.
20
+ This frees systems of the need to keep a persistent TCP connection.
21
21
22
22
## About Thrift services and transports
23
23
@@ -29,7 +29,7 @@ BufferedTransport.
29
29
30
30
### Hiveserver2
31
31
32
- [ Hiveserver2] ( https://cwiki.apache.org/confluence/display/Hive/Setting+up+HiveServer2 )
32
+ [ Hiveserver2] ( https://cwiki.apache.org/confluence/display/Hive/Setting+up+HiveServer2 )
33
33
(the new Thrift interface) can support many concurrent client connections. It is shipped
34
34
with Hive 0.10 and later. In Hive 0.10, only BufferedTranport and SaslClientTransport are
35
35
supported; starting with Hive 0.12, HTTPClientTransport is also supported.
@@ -63,7 +63,7 @@ Otherwise you'll get this nasty-looking exception in the logs:
63
63
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
64
64
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
65
65
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
66
- at java.lang.Thread.run(Thread.java:662)
66
+ at java.lang.Thread.run(Thread.java:662)
67
67
68
68
### Other Hive-compatible services
69
69
@@ -77,20 +77,20 @@ Since Hiveserver has no options, connection code is very simple:
77
77
78
78
RBHive.connect('hive.server.address', 10_000) do |connection|
79
79
connection.fetch 'SELECT city, country FROM cities'
80
- end
80
+ end
81
81
➔ [{:city => "London", :country => "UK"}, {:city => "Mumbai", :country => "India"}, {:city => "New York", :country => "USA"}]
82
82
83
83
### Hiveserver2
84
84
85
85
Hiveserver2 has several options with how it is run. The connection code takes
86
86
a hash with these possible parameters:
87
87
* ` :transport ` - one of ` :buffered ` (BufferedTransport), ` :http ` (HTTPClientTransport), or ` :sasl ` (SaslClientTransport)
88
- * ` :hive_version ` - the number after the period in the Hive version; e.g. ` 10 ` , ` 11 ` , ` 12 ` , ` 13 ` or one of
88
+ * ` :hive_version ` - the number after the period in the Hive version; e.g. ` 10 ` , ` 11 ` , ` 12 ` , ` 13 ` or one of
89
89
a set of symbols; see [ Hiveserver2 protocol versions] ( #hiveserver2-protocol-versions ) below for details
90
90
* ` :timeout ` - if using BufferedTransport or SaslClientTransport, this is how long the timeout on the socket will be
91
91
* ` :sasl_params ` - if using SaslClientTransport, this is a hash of parameters to set up the SASL connection
92
92
93
- If you pass either an empty hash or nil in place of the options (or do not supply them), the connection
93
+ If you pass either an empty hash or nil in place of the options (or do not supply them), the connection
94
94
is attempted with the Hive version set to 0.10, using ` :buffered ` as the transport, and a timeout of 1800 seconds.
95
95
96
96
Connecting with the defaults:
@@ -117,7 +117,17 @@ Connecting with a specific Hive version (0.12) and using the `:http` transport:
117
117
connection.fetch('SHOW TABLES')
118
118
end
119
119
120
- We have not tested the SASL connection, as we don't run SASL; pull requests and testing are welcomed.
120
+ Connecting with SASL and Kerberos v5:
121
+
122
+ RBHive.tcli_connect('hive.hadoop.forward.co.uk', 10_000, {
123
+ :transport => :sasl,
124
+ :sasl_params => {
125
+ :mechanism => 'GSSAPI',
126
+ :remote_host => 'example.com',
127
+ :remote_principal => 'hive/[email protected] '
128
+ ) do |connection|
129
+ connection.fetch("show tables")
130
+ end
121
131
122
132
#### Hiveserver2 protocol versions
123
133
@@ -204,7 +214,7 @@ one of the following values and meanings:
204
214
| : unknown | The query is in an unknown state
205
215
| : pending | The query is ready to run but is not running
206
216
207
- There are also the utility methods ` async_is_complete?(handles) ` , ` async_is_running?(handles) ` ,
217
+ There are also the utility methods ` async_is_complete?(handles) ` , ` async_is_running?(handles) ` ,
208
218
` async_is_failed?(handles) ` and ` async_is_cancelled?(handles) ` .
209
219
210
220
#### ` async_cancel(handles) `
@@ -225,14 +235,14 @@ same way as the normal synchronous methods.
225
235
226
236
RBHive.connect('hive.server.address', 10_000) do |connection|
227
237
connection.fetch 'SELECT city, country FROM cities'
228
- end
238
+ end
229
239
➔ [{:city => "London", :country => "UK"}, {:city => "Mumbai", :country => "India"}, {:city => "New York", :country => "USA"}]
230
240
231
241
#### Hiveserver2
232
242
233
243
RBHive.tcli_connect('hive.server.address', 10_000) do |connection|
234
244
connection.fetch 'SELECT city, country FROM cities'
235
- end
245
+ end
236
246
➔ [{:city => "London", :country => "UK"}, {:city => "Mumbai", :country => "India"}, {:city => "New York", :country => "USA"}]
237
247
238
248
### Executing a query
@@ -266,13 +276,13 @@ Then for Hiveserver:
266
276
267
277
RBHive.connect('hive.server.address', 10_000) do |connection|
268
278
connection.create_table(table)
269
- end
279
+ end
270
280
271
281
Or Hiveserver2:
272
282
273
283
RBHive.tcli_connect('hive.server.address', 10_000) do |connection|
274
284
connection.create_table(table)
275
- end
285
+ end
276
286
277
287
### Modifying table schema
278
288
@@ -290,18 +300,18 @@ Then for Hiveserver:
290
300
291
301
RBHive.connect('hive.server.address') do |connection|
292
302
connection.replace_columns(table)
293
- end
303
+ end
294
304
295
305
Or Hiveserver2:
296
306
297
307
RBHive.tcli_connect('hive.server.address') do |connection|
298
308
connection.replace_columns(table)
299
- end
309
+ end
300
310
301
311
### Setting properties
302
312
303
313
You can set various properties for Hive tasks, some of which change how they run. Consult the Apache
304
- Hive documentation and Hadoop's documentation for the various properties that can be set.
314
+ Hive documentation and Hadoop's documentation for the various properties that can be set.
305
315
For example, you can set the map-reduce job's priority with the following:
306
316
307
317
connection.set("mapred.job.priority", "VERY_HIGH")
@@ -310,15 +320,15 @@ For example, you can set the map-reduce job's priority with the following:
310
320
311
321
#### Hiveserver
312
322
313
- RBHive.connect('hive.hadoop.forward.co.uk', 10_000) {|connection|
323
+ RBHive.connect('hive.hadoop.forward.co.uk', 10_000) {|connection|
314
324
result = connection.fetch("describe some_table")
315
325
puts result.column_names.inspect
316
326
puts result.first.inspect
317
327
}
318
328
319
329
#### Hiveserver2
320
330
321
- RBHive.tcli_connect('hive.hadoop.forward.co.uk', 10_000) {|connection|
331
+ RBHive.tcli_connect('hive.hadoop.forward.co.uk', 10_000) {|connection|
322
332
result = connection.fetch("describe some_table")
323
333
puts result.column_names.inspect
324
334
puts result.first.inspect
0 commit comments