Можно ли узнать, какое регулярное выражение было выполнено при использовании rlike в apache spark - PullRequest
1 голос
/ 24 января 2020

Пример:

val surveyDF = List(
  ("I like pizza"),
  ("I love French fries"),
  ("Milkshake is so cute"),
  ("Icecream is yummy")
).toDF("survey")

val items = List("piz.*", "Ice.*")

Я хотел бы узнать, сколько таких, как пицца и мороженое.

С помощью функции rlike , доступной в apache spark, Я могу получить результат

val resutl = surveyDF
  .withColumn(
    "contains_items",
    col("survey").rlike(items.mkString("|"))
  )
  .show(truncate = false) 

Результаты:

+-------------------+-------------------+
|survey             |contains_items     |
+-------------------+-------------------+
|I like pizza       |true               |
|I love French fries|false              |
|Milkshake is cute  |false              |
|Ice cream is yummy |true               |
+-------------------+-------------------+

Как мы знаем, rlike вернет только true или false , я хотел знать, есть ли любая опция, чтобы получить , какое регулярное выражение выполняется в true ,

Ожидаемые результаты:

+-------------------+-------------------+----------+
|survey             |contains_items     |regex     |
+-------------------+-------------------+----------+
|I like pizza       |true               |piz.*     |
|I love French fries|false              |null      |
|Milkshake is cute  |false              |null      |
|Icecream is yummy  |true               |Ice.*     |
+-------------------+-------------------+----------+

Ответы [ 3 ]

0 голосов
/ 24 января 2020

Обновления :: Ответ обновляется на основе обновлений в Вопросе, чтобы показать шаблон вместо совпадения.

Это пример кода для нового подхода:

val items = List("piz.*", "Ice.*")

def matchedPattern(input: String): String = {
  var matchedList  = List[String]()
  var notMatchedList  = List[String]()
  items.foreach { x =>
    val pattern = x.r
    val matchedString = pattern.findFirstIn(input)
    matchedString match {
      case Some(i) => 
        matchedList = x :: matchedList
      case None => notMatchedList = x :: notMatchedList
    }
  }
  matchedList.mkString(",")
}
val matchedPatternUDF = udf[String, String](matchedPattern)

val surveyDF = List(
  ("I like pizza "),
  ("I love French fries"),
  ("Milkshake is so cute"),
  ("Icecream is yummy"),
  ("pizza and Icecreams are yummy")
).toDF("survey")

val resultDF = surveyDF.
 withColumn("contains_items", length(regexp_extract(col("survey"), regexPattern,0)) > 0).
 withColumn("likes", matchedPatternUDF(col("survey")))

resultDF.show(10,false)

Выход:

+-----------------------------+--------------+-----------+
|survey                       |contains_items|likes      |
+-----------------------------+--------------+-----------+
|I like pizza                 |true          |piz.*      |
|I love French fries          |false         |           |
|Milkshake is so cute         |false         |           |
|Icecream is yummy            |true          |Ice.*      |
|pizza and Icecreams are yummy|true          |Ice.*,piz.*|
+-----------------------------+--------------+-----------+
0 голосов
/ 24 января 2020

Здесь вы go с UDF и замыканием,

val surveyDF = List( ("I like pizza"), ("I love French fries"), ("Milkshake is so cute"), ("Icecream is yummy"), ("pizza Ice") ).toDF("survey")

val items = List("piz.*", "Ice.*")

val rgxFind = udf((survey: String) => items.filter(x => x.r.findFirstMatchIn(survey).nonEmpty).mkString(","))

result.withColumn("regex", rgxFind($"survey")).show(false) +--------------------+--------------+-----------+ |survey |contains_items|regex | +--------------------+--------------+-----------+ |I like pizza |true |piz.* | |I love French fries |false | | |Milkshake is so cute|false | | |Icecream is yummy |true |Ice.* | |pizza Ice |true |piz.*,Ice.*| +--------------------+--------------+-----------+

0 голосов
/ 24 января 2020
scala> import org.apache.spark.sql.expressions.UserDefinedFunction

scala> surveyDF.show()
+-----------------------------+
|survey                       |
+-----------------------------+
|I like pizza                 |
|I love French fries          |
|Milkshake is so cute         |
|Icecream is yummy            |
|pizza and Icecreams are yummy|
+-----------------------------+



scala> def MatchWord:UserDefinedFunction = udf((line:String,pattern:String) => {
     | var out = ""
     | import scala.util.matching.Regex
     | val patternList = pattern.split("~").toList
     | patternList.foreach{ x =>
     |           val patternRgx = new Regex(x)
     |           val patternCheck = (patternRgx findAllIn line).mkString(",")
     | if(patternCheck != "")
     | {out = out + "," + x}
     | }
     | out.replaceFirst(s""",""","") })

scala> val items = List("piz.*", "Ice.*")

scala> surveyDF.withColumn("contains_items",col("survey").rlike(items.mkString("|")))
           .withColumn("regex", when(col("contains_items"), MatchWord(col("survey"),lit(items.mkString("~")))))
           .show(false)
+-----------------------------+--------------+-----------+
|survey                       |contains_items|regex      |
+-----------------------------+--------------+-----------+
|I like pizza                 |true          |piz.*      |
|I love French fries          |false         |null       |
|Milkshake is so cute         |false         |null       |
|Icecream is yummy            |true          |Ice.*      |
|pizza and Icecreams are yummy|true          |piz.*,Ice.*|
+-----------------------------+--------------+-----------+
...