This guide describes alternative methods to install the sparklyr JAR on HDFS when the automated setup_jar.R script cannot be used.
- Cloudera CDP: 7.1.9
- Spark: 3.3.2 (or 3.x versions)
- Scala: 2.12
For Spark 3.3.2 on Cloudera CDP 7.1.9, you should use one of these JARs (to be tested):
sparklyr-3.0-2.12.jar(try this first)sparklyr-3.5-2.12.jar(alternative to test)
Note: The exact compatibility between sparklyr JAR versions and Spark 3.3.2 needs to be tested in your environment. Start with
sparklyr-3.0-2.12.jarand if you encounter issues, trysparklyr-3.5-2.12.jar.
This method uses WebHDFS through Knox Gateway without requiring direct HDFS access or edge node access.
Find the JAR in your local R sparklyr installation:
R --slave -e "system.file('java', 'sparklyr-3.0-2.12.jar', package='sparklyr')"Or download from GitHub:
# For Spark 3.0
wget https://github.com/sparklyr/sparklyr/releases/download/v1.8.5/sparklyr-3.0-2.12.jar
# For Spark 3.5 (alternative)
wget https://github.com/sparklyr/sparklyr/releases/download/v1.8.5/sparklyr-3.5-2.12.jarcurl -k -u "username:password" -X PUT \
"https://knox_host:8443/gateway/cdp-proxy-api/webhdfs/v1/user/username/sparklyr?op=MKDIRS"curl -k -i -u "username:password" -X PUT \
"https://knox_host:8443/gateway/cdp-proxy-api/webhdfs/v1/user/username/sparklyr/sparklyr-3.0-2.12.jar?op=CREATE&overwrite=true"This returns a Location: header with the upload URL.
curl -k -u "username:password" -X PUT -T sparklyr-3.0-2.12.jar \
"LOCATION_URL_FROM_STEP_3"SPARKLYR_JAR_PATH <- "hdfs:///user/username/sparklyr/sparklyr-3.0-2.12.jar"If you have access to an edge node with HDFS client:
# Create directory
hdfs dfs -mkdir -p /user/your_username/sparklyr
# Upload JAR
hdfs dfs -put sparklyr-3.0-2.12.jar /user/your_username/sparklyr/
# Verify
hdfs dfs -ls /user/your_username/sparklyr/Then configure:
SPARKLYR_JAR_PATH <- "hdfs:///user/your_username/sparklyr/sparklyr-3.0-2.12.jar"If sparklyr JARs are already available on your cluster (some Cloudera deployments include them):
# Search for existing JARs
find /opt/cloudera/parcels -name "sparklyr*.jar" 2>/dev/nullIf found, use the file:// protocol:
SPARKLYR_JAR_PATH <- "file:///opt/cloudera/parcels/CDH/jars/sparklyr-3.0-2.12.jar"After installation, verify the JAR is accessible:
curl -k -u "username:password" \
"https://knox_host:8443/gateway/cdp-proxy-api/webhdfs/v1/user/username/sparklyr/sparklyr-3.0-2.12.jar?op=GETFILESTATUS"hdfs dfs -ls /user/your_username/sparklyr/sparklyr-3.0-2.12.jarIf you get "JAR not found" when connecting:
- Check the path is correct in
knox_config.R - Verify HDFS path format:
hdfs:///user/...(three slashes) - Verify file permissions:
hdfs dfs -ls -h /user/your_username/sparklyr/
If you get Spark version compatibility errors:
- Try the alternative JAR version (switch between
sparklyr-3.0-2.12.jarandsparklyr-3.5-2.12.jar) - Check Spark version:
spark-submit --version - Ensure Scala version matches (2.12 for Spark 3.x)
Common issues:
- 403 Forbidden: Check Knox credentials and permissions
- 404 Not Found: Verify KNOX_WEBHDFS_URL is correct
- Connection refused: Check Knox service is running and accessible
- JAR size is typically 5-10 MB
- Upload time depends on network speed (usually < 1 minute)
- The JAR is shared across all your Spark sessions
- You only need to upload once per Spark version