Наконец, я могу ответить на свой вопрос!
Где allData
- это RDD[LabeledPoint]
:
// The following import doesn't work externally because the implicits object is defined inside the SQLContext class
val sqlContext = SparkSession
.builder()
.appName("Spark In Action")
.master("local")
.getOrCreate()
import sqlContext.implicits._
// Create a DataFrame from RDD[LabeledPoint]
val all = allData.map(e => (e.label, e.features))
val df_all = all.toDF("labels", "features")
// Scaler instance above with the same min(0) and max(1)
val scaler = new MinMaxScaler()
.setInputCol("features")
.setOutputCol("featuresScaled")
.setMax(1)
.setMin(0)
// Scaling
var df_scaled = scaler.fit(df_all).transform(df_all)
// Drop the unscaled column
df_scaled = df_scaled.drop("features")
// Convert DataFrame to RDD[LabeledPoint]
val rdd_scaled = df_scaled.rdd.map(row => LabeledPoint(
row.getAs[Double]("labels"),
row.getAs[Vector]("featuresScaled")
))
Надеюсь, это поможет кому-то еще!