我正在比较spark和sklearn的k-means算法的结果。我正在用这两种算法绘制样本到它们最近的聚类中心的距离平方和。
from pyspark.sql import SparkSession
import pandas as pd
import pyspark.sql.functions as sf
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans as spark_KMeans
spark = SparkSession.builder.master("local[*]").getOrCreate()
pdf = pd.DataFrame(load_iris()['data'],
columns=load_iris()['feature_names']
)
pdf = pd.DataFrame(load_iris().data, columns=load_iris().feature_names)
pdf.columns = pdf.columns.str.replace(' \(cm\)', '').str.replace(' ', '_').tolist()
sdf = spark.createDataFrame(pdf)
vecAssembler = VectorAssembler(inputCols=sdf.columns, outputCol="features")
sdf = vecAssembler.transform(sdf)
clustering_spark_model = spark_KMeans()\
.setFeaturesCol("features")\
.setPredictionCol("cluster")
# .setInitMode("k-means||")\
# .setInitSteps(2)
sse_pdf, sse_sdf ={}, {}
for k in range(2,30):
km_pdf = KMeans(n_clusters=k, n_init=20)
km_pdf.fit(pdf)
sse_pdf[k] = km_pdf.inertia_
km_sdf = clustering_spark_model.setK(k)
km_model_sdf = km_sdf.fit(sdf)
sse_sdf[k] = km_model_sdf.computeCost(sdf)
plt.figure()
x_pdf, y_pdf = zip(*list(sse_pdf.items()))
x_sdf, y_sdf = zip(*list(sse_sdf.items()))
plt.plot(x_pdf,y_pdf, label="sklearn")
plt.plot(x_sdf,y_sdf, label="spark")
plt.xlabel("Cluster number")
plt.ylabel("SSE")
plt.legend(loc="upper right")
plt.show()
我看到了spark算法的一些波动和sklearn算法的一些差异。有什么期待吗?我试过在这两种算法中设置initmode、增加initsteps和设置种子,但是对于大的簇数仍然存在一些差异。
暂无答案!
目前还没有任何答案,快来回答吧!