Skip to content
This repository was archived by the owner on Sep 23, 2020. It is now read-only.

Commit 05a5d85

Browse files
author
John Glorioso
committed
REPT-179-7 - Incorporate pull request forward3d#48 for forward3d/rbhive
1 parent 468396e commit 05a5d85

File tree

4 files changed

+133
-99
lines changed

4 files changed

+133
-99
lines changed

README.md

Lines changed: 27 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ It is capable of using the following Thrift transports:
1717

1818
As of version 1.0, it supports asynchronous execution of queries. This allows you to submit
1919
a query, disconnect, then reconnect later to check the status and retrieve the results.
20-
This frees systems of the need to keep a persistent TCP connection.
20+
This frees systems of the need to keep a persistent TCP connection.
2121

2222
## About Thrift services and transports
2323

@@ -29,7 +29,7 @@ BufferedTransport.
2929

3030
### Hiveserver2
3131

32-
[Hiveserver2](https://cwiki.apache.org/confluence/display/Hive/Setting+up+HiveServer2)
32+
[Hiveserver2](https://cwiki.apache.org/confluence/display/Hive/Setting+up+HiveServer2)
3333
(the new Thrift interface) can support many concurrent client connections. It is shipped
3434
with Hive 0.10 and later. In Hive 0.10, only BufferedTranport and SaslClientTransport are
3535
supported; starting with Hive 0.12, HTTPClientTransport is also supported.
@@ -63,7 +63,7 @@ Otherwise you'll get this nasty-looking exception in the logs:
6363
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
6464
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
6565
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
66-
at java.lang.Thread.run(Thread.java:662)
66+
at java.lang.Thread.run(Thread.java:662)
6767

6868
### Other Hive-compatible services
6969

@@ -77,20 +77,20 @@ Since Hiveserver has no options, connection code is very simple:
7777

7878
RBHive.connect('hive.server.address', 10_000) do |connection|
7979
connection.fetch 'SELECT city, country FROM cities'
80-
end
80+
end
8181
➔ [{:city => "London", :country => "UK"}, {:city => "Mumbai", :country => "India"}, {:city => "New York", :country => "USA"}]
8282

8383
### Hiveserver2
8484

8585
Hiveserver2 has several options with how it is run. The connection code takes
8686
a hash with these possible parameters:
8787
* `:transport` - one of `:buffered` (BufferedTransport), `:http` (HTTPClientTransport), or `:sasl` (SaslClientTransport)
88-
* `:hive_version` - the number after the period in the Hive version; e.g. `10`, `11`, `12`, `13` or one of
88+
* `:hive_version` - the number after the period in the Hive version; e.g. `10`, `11`, `12`, `13` or one of
8989
a set of symbols; see [Hiveserver2 protocol versions](#hiveserver2-protocol-versions) below for details
9090
* `:timeout` - if using BufferedTransport or SaslClientTransport, this is how long the timeout on the socket will be
9191
* `:sasl_params` - if using SaslClientTransport, this is a hash of parameters to set up the SASL connection
9292

93-
If you pass either an empty hash or nil in place of the options (or do not supply them), the connection
93+
If you pass either an empty hash or nil in place of the options (or do not supply them), the connection
9494
is attempted with the Hive version set to 0.10, using `:buffered` as the transport, and a timeout of 1800 seconds.
9595

9696
Connecting with the defaults:
@@ -117,7 +117,17 @@ Connecting with a specific Hive version (0.12) and using the `:http` transport:
117117
connection.fetch('SHOW TABLES')
118118
end
119119

120-
We have not tested the SASL connection, as we don't run SASL; pull requests and testing are welcomed.
120+
Connecting with SASL and Kerberos v5:
121+
122+
RBHive.tcli_connect('hive.hadoop.forward.co.uk', 10_000, {
123+
:transport => :sasl,
124+
:sasl_params => {
125+
:mechanism => 'GSSAPI',
126+
:remote_host => 'example.com',
127+
:remote_principal => 'hive/[email protected]'
128+
) do |connection|
129+
connection.fetch("show tables")
130+
end
121131

122132
#### Hiveserver2 protocol versions
123133

@@ -204,7 +214,7 @@ one of the following values and meanings:
204214
| :unknown | The query is in an unknown state
205215
| :pending | The query is ready to run but is not running
206216

207-
There are also the utility methods `async_is_complete?(handles)`, `async_is_running?(handles)`,
217+
There are also the utility methods `async_is_complete?(handles)`, `async_is_running?(handles)`,
208218
`async_is_failed?(handles)` and `async_is_cancelled?(handles)`.
209219

210220
#### `async_cancel(handles)`
@@ -225,14 +235,14 @@ same way as the normal synchronous methods.
225235

226236
RBHive.connect('hive.server.address', 10_000) do |connection|
227237
connection.fetch 'SELECT city, country FROM cities'
228-
end
238+
end
229239
➔ [{:city => "London", :country => "UK"}, {:city => "Mumbai", :country => "India"}, {:city => "New York", :country => "USA"}]
230240

231241
#### Hiveserver2
232242

233243
RBHive.tcli_connect('hive.server.address', 10_000) do |connection|
234244
connection.fetch 'SELECT city, country FROM cities'
235-
end
245+
end
236246
➔ [{:city => "London", :country => "UK"}, {:city => "Mumbai", :country => "India"}, {:city => "New York", :country => "USA"}]
237247

238248
### Executing a query
@@ -266,13 +276,13 @@ Then for Hiveserver:
266276

267277
RBHive.connect('hive.server.address', 10_000) do |connection|
268278
connection.create_table(table)
269-
end
279+
end
270280

271281
Or Hiveserver2:
272282

273283
RBHive.tcli_connect('hive.server.address', 10_000) do |connection|
274284
connection.create_table(table)
275-
end
285+
end
276286

277287
### Modifying table schema
278288

@@ -290,18 +300,18 @@ Then for Hiveserver:
290300

291301
RBHive.connect('hive.server.address') do |connection|
292302
connection.replace_columns(table)
293-
end
303+
end
294304

295305
Or Hiveserver2:
296306

297307
RBHive.tcli_connect('hive.server.address') do |connection|
298308
connection.replace_columns(table)
299-
end
309+
end
300310

301311
### Setting properties
302312

303313
You can set various properties for Hive tasks, some of which change how they run. Consult the Apache
304-
Hive documentation and Hadoop's documentation for the various properties that can be set.
314+
Hive documentation and Hadoop's documentation for the various properties that can be set.
305315
For example, you can set the map-reduce job's priority with the following:
306316

307317
connection.set("mapred.job.priority", "VERY_HIGH")
@@ -310,15 +320,15 @@ For example, you can set the map-reduce job's priority with the following:
310320

311321
#### Hiveserver
312322

313-
RBHive.connect('hive.hadoop.forward.co.uk', 10_000) {|connection|
323+
RBHive.connect('hive.hadoop.forward.co.uk', 10_000) {|connection|
314324
result = connection.fetch("describe some_table")
315325
puts result.column_names.inspect
316326
puts result.first.inspect
317327
}
318328

319329
#### Hiveserver2
320330

321-
RBHive.tcli_connect('hive.hadoop.forward.co.uk', 10_000) {|connection|
331+
RBHive.tcli_connect('hive.hadoop.forward.co.uk', 10_000) {|connection|
322332
result = connection.fetch("describe some_table")
323333
puts result.column_names.inspect
324334
puts result.first.inspect

lib/rbhive/t_c_l_i_connection.rb

Lines changed: 29 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ def flush
3030
end
3131

3232
module RBHive
33-
33+
3434
HIVE_THRIFT_MAPPING = {
3535
10 => 0,
3636
11 => 1,
@@ -84,32 +84,32 @@ class TCLIConnection
8484
def initialize(server, port = 10_000, options = {}, logger = StdOutLogger.new)
8585
options ||= {} # backwards compatibility
8686
raise "'options' parameter must be a hash" unless options.is_a?(Hash)
87-
87+
8888
if options[:transport] == :sasl and options[:sasl_params].nil?
8989
raise ":transport is set to :sasl, but no :sasl_params option was supplied"
9090
end
91-
91+
9292
# Defaults to buffered transport, Hive 0.10, 1800 second timeout
9393
options[:transport] ||= :buffered
9494
options[:hive_version] ||= 10
9595
options[:timeout] ||= 1800
9696
@options = options
97-
97+
9898
# Look up the appropriate Thrift protocol version for the supplied Hive version
9999
@thrift_protocol_version = thrift_hive_protocol(options[:hive_version])
100-
100+
101101
@logger = logger
102102
@transport = thrift_transport(server, port)
103103
@protocol = Thrift::BinaryProtocol.new(@transport)
104104
@client = Hive2::Thrift::TCLIService::Client.new(@protocol)
105105
@session = nil
106106
@logger.info("Connecting to HiveServer2 #{server} on port #{port}")
107107
end
108-
108+
109109
def thrift_hive_protocol(version)
110110
HIVE_THRIFT_MAPPING[version] || raise("Invalid Hive version")
111111
end
112-
112+
113113
def thrift_transport(server, port)
114114
@logger.info("Initializing transport #{@options[:transport]}")
115115
case @options[:transport]
@@ -188,7 +188,7 @@ def set(name,value)
188188
@logger.info("Setting #{name}=#{value}")
189189
self.execute("SET #{name}=#{value}")
190190
end
191-
191+
192192
# Async execute
193193
def async_execute(query)
194194
@logger.info("Executing query asynchronously: #{query}")
@@ -204,35 +204,35 @@ def async_execute(query)
204204

205205
# Return handles to get hold of this query / session again
206206
{
207-
session: @session.sessionHandle,
208-
guid: op_handle.operationId.guid,
207+
session: @session.sessionHandle,
208+
guid: op_handle.operationId.guid,
209209
secret: op_handle.operationId.secret
210210
}
211211
end
212-
212+
213213
# Is the query complete?
214214
def async_is_complete?(handles)
215215
async_state(handles) == :finished
216216
end
217-
217+
218218
# Is the query actually running?
219219
def async_is_running?(handles)
220220
async_state(handles) == :running
221221
end
222-
222+
223223
# Has the query failed?
224224
def async_is_failed?(handles)
225225
async_state(handles) == :error
226226
end
227-
227+
228228
def async_is_cancelled?(handles)
229229
async_state(handles) == :cancelled
230230
end
231-
231+
232232
def async_cancel(handles)
233233
@client.CancelOperation(prepare_cancel_request(handles))
234234
end
235-
235+
236236
# Map states to symbols
237237
def async_state(handles)
238238
response = @client.GetOperationStatus(
@@ -262,18 +262,18 @@ def async_state(handles)
262262
return :state_not_in_protocol
263263
end
264264
end
265-
265+
266266
# Async fetch results from an async execute
267267
def async_fetch(handles, max_rows = 100)
268268
# Can't get data from an unfinished query
269269
unless async_is_complete?(handles)
270270
raise "Can't perform fetch on a query in state: #{async_state(handles)}"
271271
end
272-
272+
273273
# Fetch and
274274
fetch_rows(prepare_operation_handle(handles), :first, max_rows)
275275
end
276-
276+
277277
# Performs a query on the server, fetches the results in batches of *batch_size* rows
278278
# and yields the result batches to a given block as arrays of rows.
279279
def async_fetch_in_batch(handles, batch_size = 1000, &block)
@@ -290,12 +290,12 @@ def async_fetch_in_batch(handles, batch_size = 1000, &block)
290290
yield rows
291291
end
292292
end
293-
293+
294294
def async_close_session(handles)
295295
validate_handles!(handles)
296296
@client.CloseSession(Hive2::Thrift::TCloseSessionReq.new( sessionHandle: handles[:session] ))
297297
end
298-
298+
299299
# Pull rows from the query result
300300
def fetch_rows(op_handle, orientation = :first, max_rows = 1000)
301301
fetch_req = prepare_fetch_results(op_handle, orientation, max_rows)
@@ -304,7 +304,7 @@ def fetch_rows(op_handle, orientation = :first, max_rows = 1000)
304304
rows = fetch_results.results.rows
305305
TCLIResultSet.new(rows, TCLISchemaDefinition.new(get_schema_for(op_handle), rows.first))
306306
end
307-
307+
308308
# Performs a explain on the supplied query on the server, returns it as a ExplainResult.
309309
# (Only works on 0.12 if you have this patch - https://issues.apache.org/jira/browse/HIVE-5492)
310310
def explain(query)
@@ -323,7 +323,7 @@ def fetch(query, max_rows = 100)
323323

324324
# Get search operation handle to fetch the results
325325
op_handle = exec_result.operationHandle
326-
326+
327327
# Fetch the rows
328328
fetch_rows(op_handle, :first, max_rows)
329329
end
@@ -332,7 +332,7 @@ def fetch(query, max_rows = 100)
332332
# and yields the result batches to a given block as arrays of rows.
333333
def fetch_in_batch(query, batch_size = 1000, &block)
334334
raise "No block given for the batch fetch request!" unless block_given?
335-
335+
336336
# Execute the query and check the result
337337
exec_result = execute(query)
338338
raise_error_if_failed!(exec_result)
@@ -375,7 +375,9 @@ def method_missing(meth, *args)
375375
private
376376

377377
def prepare_open_session(client_protocol)
378-
req = ::Hive2::Thrift::TOpenSessionReq.new( @options[:sasl_params].nil? ? [] : @options[:sasl_params] )
378+
req = ::Hive2::Thrift::TOpenSessionReq.new( @options[:sasl_params].nil? ? [] : {
379+
:username => @options[:sasl_params][:username],
380+
:password => @options[:sasl_params][:password]})
379381
req.client_protocol = client_protocol
380382
req
381383
end
@@ -410,13 +412,13 @@ def prepare_operation_handle(handles)
410412
hasResultSet: false
411413
)
412414
end
413-
415+
414416
def prepare_cancel_request(handles)
415417
Hive2::Thrift::TCancelOperationReq.new(
416418
operationHandle: prepare_operation_handle(handles)
417419
)
418420
end
419-
421+
420422
def validate_handles!(handles)
421423
unless handles.has_key?(:guid) and handles.has_key?(:secret) and handles.has_key?(:session)
422424
raise "Invalid handles hash: #{handles.inspect}"

0 commit comments

Comments
 (0)