-
Notifications
You must be signed in to change notification settings - Fork 79
Description
Hi,
I am trying to create a spark session using raydp.init() and I am running running into an issue. I believe the issue stems from passing in spark.driver.extraClassPath into init. When I remove spark.driver.extraClassPath from the config in the code below, the code works with no issues. I want to pass jars into the spark.driver.extraClassPath because I am using jars that require the driver to have access to them.
From my understanding, spark on ray runs in client mode. The documentation on spark.driver.extraClassPath states
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
I have set spark.driver.extraClassPath in my default properties file but this did not seem to fix the issue.
Is there a way to set spark.driver.extraClassPath before the spark driver JVM starts?
Please let me know if I need to provide more information.
Environment Details
python==3.9.13
pyspark==3.5.1
ray==2.5.0
raydp==1.6.1
Reproducible Script
import raydp
import pyspark
import numpy as np
import glob
import os
spark_home = os.environ.get("SPARK_HOME", os.path.dirname(pyspark.__file__))
spark_jars = os.path.abspath(os.path.join(spark_home, "jars/*"))
spark_config = {
'spark.driver.host': '127.0.0.1',
'spark.driver.bindAddress': '0.0.0.0',
'spark.driver.memory': '10G',
'spark.driver.maxResultSize': '6G',
'spark.ui.port': '4041',
}
spark_config.update(
{
"spark.jars": ",".join(glob.glob(spark_jars)),
"spark.driver.extraClassPath": ":".join(glob.glob(spark_jars)),
"spark.executor.extraClassPath": ":".join(glob.glob(spark_jars)),
"raydp.executor.extraClassPath": ":".join(glob.glob(spark_jars)),
}
)
app_name = "RAYAPPTEST"
spark_app_id = f"Spark for {app_name} - {np.random.randint(1, 1000)}"
num_executors = 1
cores_per_executor = 1
memory_per_executor = "2G"
spark = raydp.init_spark(spark_app_id, num_executors, cores_per_executor, memory_per_executor, configs=spark_config)
print(spark)Stack Trace
Exception in thread "main" org.apache.spark.SparkException: Master must either be yarn or start with spark, mesos, k8s, or local
at org.apache.spark.deploy.SparkSubmit.error(SparkSubmit.scala:1047)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:256)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:964)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
File "reproduce_ray_issue.py", line 23, in <module>
spark = raydp.init_spark(spark_app_id, num_executors, cores_per_executor, memory_per_executor, configs=spark_config)
File "/env/lib/python3.9/site-packages/raydp/context.py", line 215, in init_spark
return _global_spark_context.get_or_create_session()
File "/env/lib/python3.9/site-packages/raydp/context.py", line 122, in get_or_create_session
self._spark_session = spark_cluster.get_spark_session()
File "/env/lib/python3.9/site-packages/raydp/spark/ray_cluster.py", line 189, in get_spark_session
spark_builder.appName(self._app_name).master(self.get_cluster_url()).getOrCreate()
File "/env/lib/python3.9/site-packages/pyspark/sql/session.py", line 497, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/env/lib/python3.9/site-packages/pyspark/context.py", line 515, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/env/lib/python3.9/site-packages/pyspark/context.py", line 201, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/env/lib/python3.9/site-packages/pyspark/context.py", line 436, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/env/lib/python3.9/site-packages/pyspark/java_gateway.py", line 107, in launch_gateway
raise PySparkRuntimeError(
pyspark.errors.exceptions.base.PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.