установите findFirstMatchIn, чтобы игнорировать строку, если совпадение не найдено - искра - scala - PullRequest
0 голосов
/ 15 февраля 2019

Я новичок в скале.Я пытаюсь написать код для сопоставления проанализированных чисел в серии XML-файлов.Мой код работает для небольшого СДР, как показано ниже:

val myrdd = sc.parallelize(Array("FavoriteCount=\"23\" Score=\"43\"","FavoriteCount=\"12\" Score=\"32\"","FavoriteCount=\"32\" Score=\"2\""))

def successMatches(s: String): (String,Int) = {
  val fcountMatcher = """FavoriteCount=\"(\d+)\"""".r
  val scoreMatcher = """Score=\"(\d+)\"""".r
  val fcount = fcountMatcher.findFirstMatchIn(s).get.group(1)
  val score = scoreMatcher.findFirstMatchIn(s).get.group(1)
  (fcount,score.toInt)
}


val myWords = myrdd.map(x => successMatches(x))
    myWords.take(3)

myrdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[4] at parallelize at <console>:29
successMatches: (s: String)(String, Int)
myWords: Array[(String, Int)] = Array((23,43), (12,32), (32,2))
res1: Array[(String, Int)] = Array((23,43), (12,32), (32,2))

для реального XML-СДР он возвращает сообщение об ошибке, как показано ниже:

    val myWords = valid_lines.take(1).map(x => successMatches(x))
    myWords.take(1)

ava.util.NoSuchElementException: None.get
  at scala.None$.get(Option.scala:347)
  at scala.None$.get(Option.scala:345)
  at successMatches(<console>:52)
  at $anonfun$1.apply(<console>:57)
  at $anonfun$1.apply(<console>:57)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
  ... 42 elided

Чего мне не хватает?

Вот как выглядит первый элемент XML RDD:

valid_lines.take(1)

res51: Array[String] = Array("  <row AnswerCount="0" Body="&lt;p&gt;I'm having trouble with a basic machine learning methodology question. I understand the concept of not using the same data to both train and evaluate a classifier, and furthermore when there are parameters in an algorithm to be optimized, you should use an independent third test set to get the final reportable performance figures (e.g. recall rate). However, using a &lt;em&gt;single&lt;/em&gt; test set to measure performance seems to be problematic because the measures of performance will likely differ depending on how the data is partitioned into training (plus validation) and test sets, especially for small datasets. It would be better to average the results of N different partitions.&lt;/p&gt;&#10;&#10;&lt;p&gt;For t...
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...