Я сгенерировал кластеры как
SepalLengthCm|SepalWidthCm|PetalLengthCm|PetalWidthCm| features|prediction|
+-------------+------------+-------------+------------+-----------------+----------+
| 5.1| 3.5| 1.4| 0.2|[5.1,3.5,1.4,0.2]| 1|
| 4.9| 3.0| 1.4| 0.2|[4.9,3.0,1.4,0.2]| 1|
| 4.7| 3.2| 1.3| 0.2|[4.7,3.2,1.3,0.2]| 0|
| 4.6| 3.1| 1.5| 0.2|[4.6,3.1,1.5,0.2]| 1|
| 5.0| 3.6| 1.4| 0.2|[5.0,3.6,1.4,0.2]| 2|
| 5.4| 3.9| 1.7| 0.4|[5.4,3.9,1.7,0.4]| 0|
Теперь я хочу предсказать новую точку данных, скажем, которая имеет эти точки данных
SepalLengthCm|SepalWidthCm|PetalLengthCm|PetalWidthCm| features|prediction|
+-------------+------------+-------------+------------+-----------------+----------+
| 3.2| 5.2| 0.4| 0.2|[3.2,5.2,0.4,0.2]|
| 5.9| 3.5| 2.4| 0.6|[5.9,3.5,2.4,0.6]|
Код, который я пишу:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
list = df.columns
vecAssembler = VectorAssembler(inputCols=list, outputCol="features")
new_df = vecAssembler.transform(df)
kmeans = KMeans(k=3) # 7 clusters here
model = kmeans.fit(new_df.select('features'))
transformed = model.transform(new_df)
transformed.show()