Создать вложенный массив данных DataFrame из существующего DataFrame - PullRequest
0 голосов
/ 23 апреля 2019

Я пытаюсь создать столбец вложенного массива структуры из кадра данных во время операции соединения в scala.Единственное, что я могу получить, это настроить массив элементов, который не выглядит записанным в выводе json.

Текущая схема, с которой я начинаю:

root
 |-- memberId: integer (nullable = false)
 |-- memberSubscriberId: integer (nullable = false)
 |-- memberIdSuffix: integer (nullable = false)
 |-- memberLastName: string (nullable = false)
 |-- memberFirstName: string (nullable = false)
 |-- memberMiddleInitial: string (nullable = false)
 |-- memberSocialSecurityNumber: string (nullable = false)
 |-- memberGender: string (nullable = false)
 |-- memberBirthDate: timestamp (nullable = false)
 |-- memberworkphonenumber: string (nullable = false)
 |-- memberworkphoneextensionnumber: string (nullable = false)
 |-- membercellphone: string (nullable = false)

root
 |-- memberSubscriberId: integer (nullable = false)
 |-- subscriberaddresstypecode: string (nullable = false)
 |-- lineOne: string (nullable = false)
 |-- lineTwo: string (nullable = false)
 |-- lineThree: string (nullable = false)
 |-- cityName: string (nullable = false)
 |-- stateCode: string (nullable = false)
 |-- zipCode: string (nullable = false)
 |-- countyCode: string (nullable = false)
 |-- countryCode: string (nullable = false)
 |-- subscriberphonenumber: string (nullable = false)
 |-- subscriberphoneextensionnumber: string (nullable = false)
 |-- subscriberfaxnumber: string (nullable = false)
 |-- subscriberfaxextensionnumber: string (nullable = false)
 |-- address: string (nullable = false)

Я думаю:

root
 |-- memberSubscriberId: integer (nullable = false)
 |-- memberId: integer (nullable = false)
 |-- memberIdSuffix: integer (nullable = false)
 |-- memberLastName: string (nullable = false)
 |-- memberFirstName: string (nullable = false)
 |-- memberMiddleInitial: string (nullable = false)
 |-- memberSocialSecurityNumber: string (nullable = false)
 |-- memberGender: string (nullable = false)
 |-- memberBirthDate: timestamp (nullable = false)
 |-- memberworkphonenumber: string (nullable = false)
 |-- memberworkphoneextensionnumber: string (nullable = false)
 |-- membercellphone: string (nullable = false)
 |-- memberAddresses: array (nullable = false)
 |    |-- lineOne: string (nullable = false)
 |    |-- lineTwo: string (nullable = false)
 |    |-- lineThree: string (nullable = false)
 |    |-- cityName: string (nullable = false)
 |    |-- stateCode: string (nullable = false)
 |    |-- zipCode: string (nullable = false)
 |    |-- countyCode: string (nullable = false)
 |    |-- countryCode: string (nullable = false)
 |-- memeberPhoneNumbers: array (nullable = false)
 |    |-- phoneNumber: string (nullable = false)
 |    |-- effectiveDate: null (nullable = true)
 |    |-- terminationDate: null (nullable = true)
 |    |-- isCurrent: null (nullable = true)
 |    |-- isActive: null (nullable = true)
 |    |-- telecomType: string (nullable = false)

Текущий код:

val clientDF: DataFrame
val addrDF: DataFrame

import spark.implicits._

    val nestedAddr = addrDF.select(
      $"clientSubscriberId",
      array(
        struct(
          $"lineOne",
          $"lineTwo",
          $"lineThree",
          $"cityName",
          $"stateCode",
          $"zipCode",
          $"countyCode",
          $"countryCode"
        )
      ).as("clientAddresses"),
      array(
        struct(
          $"subscriberphonenumber".alias("phoneNumber"),
          //$"subscriberphoneextensionnumber"
          lit(null).alias("effectiveDate"),
          lit(null).alias("terminationDate"),
          lit(null).alias("isCurrent"),
          lit(null).alias("isActive"),
          lit("home").alias("telecomType")
        ),
        struct(
          $"subscriberfaxnumber".alias("phoneNumber"),
          //$"subscriberfaxextensionnumber".map(c => col(c).as("phoneNumber"))
          lit(null).alias("effectiveDate"),
          lit(null).alias("terminationDate"),
          lit(null).alias("isCurrent"),
          lit(null).alias("isActive"),
          lit("fax").alias("telecomType")
        )
      ).as("memeberPhoneNumbers")
    )
    val addrMbrDF = mbrDF.join(nestedAddr, Seq("clientSubscriberId"))

Результирующая схема:

root
 |-- memberSubscriberId: integer (nullable = false)
 |-- memberId: integer (nullable = false)
 |-- memberIdSuffix: integer (nullable = false)
 |-- memberLastName: string (nullable = false)
 |-- memberFirstName: string (nullable = false)
 |-- memberMiddleInitial: string (nullable = false)
 |-- memberSocialSecurityNumber: string (nullable = false)
 |-- memberGender: string (nullable = false)
 |-- memberBirthDate: timestamp (nullable = false)
 |-- memberworkphonenumber: string (nullable = false)
 |-- memberworkphoneextensionnumber: string (nullable = false)
 |-- membercellphone: string (nullable = false)
 |-- memberAddresses: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- lineOne: string (nullable = false)
 |    |    |-- lineTwo: string (nullable = false)
 |    |    |-- lineThree: string (nullable = false)
 |    |    |-- cityName: string (nullable = false)
 |    |    |-- stateCode: string (nullable = false)
 |    |    |-- zipCode: string (nullable = false)
 |    |    |-- countyCode: string (nullable = false)
 |    |    |-- countryCode: string (nullable = false)
 |-- memeberPhoneNumbers: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- phoneNumber: string (nullable = false)
 |    |    |-- effectiveDate: null (nullable = true)
 |    |    |-- terminationDate: null (nullable = true)
 |    |    |-- isCurrent: null (nullable = true)
 |    |    |-- isActive: null (nullable = true)
 |    |    |-- telecomType: string (nullable = false)


Expected schema:
root
 |-- memberSubscriberId: integer (nullable = false)
 |-- memberId: integer (nullable = false)
 |-- memberIdSuffix: integer (nullable = false)
 |-- memberLastName: string (nullable = false)
 |-- memberFirstName: string (nullable = false)
 |-- memberMiddleInitial: string (nullable = false)
 |-- memberSocialSecurityNumber: string (nullable = false)
 |-- memberGender: string (nullable = false)
 |-- memberBirthDate: timestamp (nullable = false)
 |-- memberworkphonenumber: string (nullable = false)
 |-- memberworkphoneextensionnumber: string (nullable = false)
 |-- membercellphone: string (nullable = false)
 |-- memberAddresses: array (nullable = false)
 |    |-- lineOne: string (nullable = false)
 |    |-- lineTwo: string (nullable = false)
 |    |-- lineThree: string (nullable = false)
 |    |-- cityName: string (nullable = false)
 |    |-- stateCode: string (nullable = false)
 |    |-- zipCode: string (nullable = false)
 |    |-- countyCode: string (nullable = false)
 |    |-- countryCode: string (nullable = false)
 |-- memeberPhoneNumbers: array (nullable = false)
 |    |-- phoneNumber: string (nullable = false)
 |    |-- effectiveDate: null (nullable = true)
 |    |-- terminationDate: null (nullable = true)
 |    |-- isCurrent: null (nullable = true)
 |    |-- isActive: null (nullable = true)
 |    |-- telecomType: string (nullable = false)

Я пробовал несколько разных вещей, чтобы получить егона работу:

      ).as("clientAddresses"),
      array(
        struct(
      ).as("clientAddresses"),
       struct(
      ).as("clientAddresses"),
      array(
      ).as("clientAddresses"),
      collect_list(
        struct(

1 Ответ

0 голосов
/ 23 апреля 2019

Просто, ожидаемая схема, которую вы хотите, не возможна. Я имею в виду, что когда у вас есть массив, он всегда содержит element с заданной схемой, которая в вашем случае является структурой. Так что я бы сказал, что схема, которую вы получаете, это именно то, чего вы хотите достичь.

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...