Solution to connect R (sparklyr) to Apache Spark via Knox Gateway and Livy on air-gapped Cloudera CDP 7.1.9 clusters (Spark 3.3.2).
Standard sparklyr fails to connect to Livy through Knox Gateway on air-gapped Cloudera CDP 7.1.9 clusters due to three issues:
- httr library incompatibility with Knox - The R
httrpackage returns empty responses when communicating through Knox Gateway - Missing proxyUser - Spark sessions run as 'livy' user instead of the authenticated Knox user, causing permission errors
- JAR download failure - sparklyr tries to download its JAR from GitHub, which fails on air-gapped clusters
This repo provides:
- Runtime patches that replace
httrwithsystem()curl calls - Automatic proxyUser configuration using Knox username
- Automated JAR installation on HDFS via WebHDFS (no direct HDFS access needed)
install.packages("sparklyr")
install.packages("dplyr")
install.packages("jsonlite")git clone https://github.com/ab2dridi/sparklyr-livy.git
cd sparklyr-livy
cp knox_config.R.template knox_config.REdit knox_config.R with your Knox credentials and URLs.
source("setup_jar.R")This uploads the sparklyr JAR to your HDFS directory using WebHDFS through Knox (no direct HDFS access needed).
source("sparklyr_connection.R")library(dplyr)
library(DBI)
# SQL queries
dbGetQuery(sc, "SHOW DATABASES")
# Hive tables
tbl(sc, "my_table") %>%
filter(date >= "2024-01-01") %>%
collect()
# Disconnect
spark_disconnect(sc)knox_config.R.template- Configuration template (credentials, URLs, Spark parameters)setup_jar.R- Upload sparklyr JAR to HDFS via WebHDFSsparklyr_connection.R- Patched connection script (fixes httr, proxyUser, JAR issues)examples/basic_examples.R- Usage examplesPROBLEM.md- Detailed technical analysisINSTALLATION.md- Step-by-step installation guide
# knox_config.R
KNOX_USERNAME <- "your_username"
KNOX_PASSWORD <- "your_password"
KNOX_MASTER_URL <- "https://knox-host:8443/gateway/cdp-proxy-api/livy_for_spark3"
KNOX_WEBHDFS_URL <- "https://knox-host:8443/gateway/cdp-proxy-api/webhdfs/v1"
SPARK_DRIVER_MEMORY <- "4G"
SPARK_EXECUTOR_MEMORY <- "4G"
SPARK_NUM_EXECUTORS <- 2
SPARK_QUEUE <- "default"- setup_jar.R downloads the sparklyr JAR (from your R installation or GitHub) and uploads it to HDFS via WebHDFS
- sparklyr_connection.R patches 7 sparklyr functions at runtime:
- Replaces
httrcalls withsystem("curl ...")to bypass Knox incompatibility - Adds
proxyUserautomatically from Knox username - Uses HDFS JAR path instead of trying to download from GitHub
- Replaces
No modifications to sparklyr source code required.
Connection fails: Reduce resources in knox_config.R (e.g., SPARK_NUM_EXECUTORS <- 1)
JAR upload fails: See manual_jar_install.md for manual HDFS upload
More details: See PROBLEM.md and INSTALLATION.md
MIT License - see LICENSE