Meeting a problem in Gaussian Mixture clustering part

`# Gaussian Mixture clustering
from pyspark.ml.clustering import GaussianMixture

t0 = time()
gm = GaussianMixture(k=8, maxIter=150, seed=seed, featuresCol="pca_features", 
                     predictionCol="cluster", probabilityCol="gm_prob")

gm_pipeline = Pipeline(stages=[pca_slicer, pca, gm])
gm_model = gm_pipeline.fit(scaled_train_df)

gm_train_df = gm_model.transform(scaled_train_df).cache()
gm_cv_df = gm_model.transform(scaled_cv_df).cache()
gm_test_df = gm_model.transform(scaled_test_df).cache()

gm_params = (gm_model.stages[2].gaussiansDF.rdd
                  .map(lambda row: [row['mean'].toArray(), row['cov'].toArray()])
                  .collect())
gm_weights = gm_model.stages[2].weights

print(gm_train_df.count())
print(gm_cv_df.count())
print(gm_test_df.count())
print(time() - t0)`

When i run this part in jupyter notebook, an error appear：
`---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-105-a786719bf1ab> in <module>
     14 
     15 gm_params = (gm_model.stages[2].gaussiansDF.rdd
---> 16                   .map(lambda row: [row['mean'].toArray(), row['cov'].toArray()])
     17                   .collect())
     18 gm_weights = gm_model.stages[2].weights

C:\Spark\python\pyspark\rdd.py in collect(self)
    813             to be small, as all the data is loaded into the driver's memory.
    814         """
--> 815         with SCCallSiteSync(self.context) as css:
    816             sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
    817         return list(_load_from_socket(sock_info, self._jrdd_deserializer))

C:\Spark\python\pyspark\traceback_utils.py in __enter__(self)
     70     def __enter__(self):
     71         if SCCallSiteSync._spark_stack_depth == 0:
---> 72             self._context._jsc.setCallSite(self._call_site)
     73         SCCallSiteSync._spark_stack_depth += 1
     74 

AttributeError: 'NoneType' object has no attribute 'setCallSite'` 

I do some research but there is few answer，some people said it‘s spark own bug.And by the way，i didn't use docker image but build the "Anaconda 3.7.6 + pyspark 2.4.5" environment to run these code. 

Can you please help me solve the problem? I'll thank you very much！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meeting a problem in Gaussian Mixture clustering part #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Meeting a problem in Gaussian Mixture clustering part #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions