I found that by just following the instructions at https://hadoop-user-guide.web.cern.ch/hadoop-user-guide/gettingstarted_md.html I can submit this minimal job:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
conf = SparkConf().setMaster("yarn").setAppName("CMS Working Set")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
readavro = spark.read.format("com.databricks.spark.avro")
fwjr = readavro.load("/cms/wmarchive/avro/fwjr/201[789]/*/*/*.avro")
with
spark-submit --packages com.databricks:spark-avro_2.11:4.0.0 test.py
Perhaps this is a better soft introduction than the RDD complexity? Also, there seem to be lxplus options.
I found that by just following the instructions at https://hadoop-user-guide.web.cern.ch/hadoop-user-guide/gettingstarted_md.html I can submit this minimal job:
with
Perhaps this is a better soft introduction than the RDD complexity? Also, there seem to be lxplus options.