код выглядит следующим образом:
val tokenizer = new RegexTokenizer().setPattern("[\\W_]+").setMinTokenLength(4).setInputCol("sendcontent").setOutputCol("tokens")
var tokenized_df = tokenizer.transform(sourDF)
import org.apache.spark.sql.functions.{concat_ws}
val mkString = udf((arrayCol: Seq[String]) => arrayCol.mkString(","))
tokenized_df=tokenized_df.withColumn("words",mkString($"tokens")).drop("tokens")
tokenized_df.createOrReplaceTempView("tempview")
sql(s"drop table if exists $result_table")
sql(s"create table $result_table as select msgid,sendcontent,cast (words as string) as words from tempview")
Исключение составляют следующие:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, svr14614hw2288.hadoop.sh.ctripcorp.com, executor 2): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$createTransformFunc$2: (string) => array)