`# Gaussian Mixture clustering
from pyspark.ml.clustering import GaussianMixture
t0 = time()
gm = GaussianMixture(k=8, maxIter=150, seed=seed, featuresCol="pca_features",
predictionCol="cluster", probabilityCol="gm_prob")
gm_pipeline = Pipeline(stages=[pca_slicer, pca, gm])
gm_model = gm_pipeline.fit(scaled_train_df)
gm_train_df = gm_model.transform(scaled_train_df).cache()
gm_cv_df = gm_model.transform(scaled_cv_df).cache()
gm_test_df = gm_model.transform(scaled_test_df).cache()
gm_params = (gm_model.stages[2].gaussiansDF.rdd
.map(lambda row: [row['mean'].toArray(), row['cov'].toArray()])
.collect())
gm_weights = gm_model.stages[2].weights
print(gm_train_df.count())
print(gm_cv_df.count())
print(gm_test_df.count())
print(time() - t0)`
When i run this part in jupyter notebook, an error appear:
`---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
in
14
15 gm_params = (gm_model.stages[2].gaussiansDF.rdd
---> 16 .map(lambda row: [row['mean'].toArray(), row['cov'].toArray()])
17 .collect())
18 gm_weights = gm_model.stages[2].weights
C:\Spark\python\pyspark\rdd.py in collect(self)
813 to be small, as all the data is loaded into the driver's memory.
814 """
--> 815 with SCCallSiteSync(self.context) as css:
816 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
817 return list(_load_from_socket(sock_info, self._jrdd_deserializer))
C:\Spark\python\pyspark\traceback_utils.py in enter(self)
70 def enter(self):
71 if SCCallSiteSync._spark_stack_depth == 0:
---> 72 self._context._jsc.setCallSite(self._call_site)
73 SCCallSiteSync._spark_stack_depth += 1
74
AttributeError: 'NoneType' object has no attribute 'setCallSite'`
I do some research but there is few answer,some people said it‘s spark own bug.And by the way,i didn't use docker image but build the "Anaconda 3.7.6 + pyspark 2.4.5" environment to run these code.
Can you please help me solve the problem? I'll thank you very much!
`# Gaussian Mixture clustering
from pyspark.ml.clustering import GaussianMixture
t0 = time()
gm = GaussianMixture(k=8, maxIter=150, seed=seed, featuresCol="pca_features",
predictionCol="cluster", probabilityCol="gm_prob")
gm_pipeline = Pipeline(stages=[pca_slicer, pca, gm])
gm_model = gm_pipeline.fit(scaled_train_df)
gm_train_df = gm_model.transform(scaled_train_df).cache()
gm_cv_df = gm_model.transform(scaled_cv_df).cache()
gm_test_df = gm_model.transform(scaled_test_df).cache()
gm_params = (gm_model.stages[2].gaussiansDF.rdd
.map(lambda row: [row['mean'].toArray(), row['cov'].toArray()])
.collect())
gm_weights = gm_model.stages[2].weights
print(gm_train_df.count())
print(gm_cv_df.count())
print(gm_test_df.count())
print(time() - t0)`
When i run this part in jupyter notebook, an error appear:
`---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
in
14
15 gm_params = (gm_model.stages[2].gaussiansDF.rdd
---> 16 .map(lambda row: [row['mean'].toArray(), row['cov'].toArray()])
17 .collect())
18 gm_weights = gm_model.stages[2].weights
C:\Spark\python\pyspark\rdd.py in collect(self)
813 to be small, as all the data is loaded into the driver's memory.
814 """
--> 815 with SCCallSiteSync(self.context) as css:
816 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
817 return list(_load_from_socket(sock_info, self._jrdd_deserializer))
C:\Spark\python\pyspark\traceback_utils.py in enter(self)
70 def enter(self):
71 if SCCallSiteSync._spark_stack_depth == 0:
---> 72 self._context._jsc.setCallSite(self._call_site)
73 SCCallSiteSync._spark_stack_depth += 1
74
AttributeError: 'NoneType' object has no attribute 'setCallSite'`
I do some research but there is few answer,some people said it‘s spark own bug.And by the way,i didn't use docker image but build the "Anaconda 3.7.6 + pyspark 2.4.5" environment to run these code.
Can you please help me solve the problem? I'll thank you very much!